Virtual Screening of Conjugated Polymers for Organic Photovoltaic Devices Using Support Vector Machines and Ensemble Learning

Herein, we report virtual screening of potential semiconductor polymers for high-performance organic photovoltaic (OPV) devices using various machine learning algorithms. We particularly focus on support vector machine (SVM) and ensemble learning approaches. We found that the power conversion efficiencies of the device prepared with the polymer candidates can be predicted with their structure fingerprints as the only inputs. In other words, no preliminary knowledge about material properties was required. Additionally, the predictive performance could be further improved by “blending” the results of the SVM and random forest models. The resulting ensemble learning algorithm might open up a new opportunity for more precise, high-throughput virtual screening of conjugated polymers for OPV devices.


Introduction
Organic photovoltaic (OPV) devices have been attracting much attention because of their advantageous properties, including light weight, mechanical flexibility, low material and fabrication cost, and short energy payback times [1][2][3][4].Apart from traditional solar panels, possible applications of OPV devices also include power generators for wearable electronics, portable devices, and the Internet of things (IoTs) [5][6][7][8][9][10].The state-of-the-art OPV devices are prepared based on the concept of "bulk heterojunction."Although some novel nonfullerene acceptors have been identified as promising candidates [3], the photoactive layers of most efficient OPV devices reported so far consist of a conjugated polymer as the electron donor and a fullerene derivative as an electron acceptor, forming interpenetrating p-n heterojunction networks [1].The high interfacial area between the donor and acceptor, which can overcome the high binding energy of excitons, ensures effective generation of free charge carriers upon solar irradiation, thereby resulting in a high photocurrent and a decent power conversion efficiency (PCE) [1][2][3][4].
The design of organic materials apparently is one of the focuses for OPV-related research.Unfortunately, the large size of chemical space, which is recently estimated at on the order of 10 6 molecules, makes rational material search very challenging [11][12][13][14][15].For instance, the Harvard Clean Energy Project explored the molecular space through basic combination rules from an initial collection of 26 molecular fragments and resulted in calculation of material properties for 3.5 million materials [15].Because the combination of the just 26 building blocks leads to such a great number of molecules for OPV applications, it required a distributed computing framework to implement such huge calculations [15].Accordingly, there remains a need for an effective virtual screening method, which is capable to fast screen potential organic materials in parallel to the experimental selection and validation.In addition to high precision, an ideal screening platform also requires high level of generality and should be able to adapt itself to the rapid development of new materials.
Recently, machine learning (ML) has been extensively employed for accelerating the virtual screening of organic materials in various fields, such as organic light-emitting diodes (OLEDs) and OPV devices [12][13][14][15][16][17].In particular, multilayer perceptrons have been used to yield highly accurate predictions on many properties to accelerate material discovery for OPV devices [15].More recently, Nagasawa et al. collected experimental results of more than 1200 conjugated polymers from ~500 literatures; the parameters included band gaps (E g ), molecular weights (M w ), energy levels, and fingerprints of the chemical structures [17].The authors conducted supervised ML based on random forest (RF) and artificial neural network (ANN) models for material screening of potential polymer donors for bulk heterojunction OPV devices.Therefore, for the first time, they were able to design a conjugated polymer using machine learning algorithms.
The support vector machine (SVM) is one of the most common supervised machine learning algorithms [18].It creates a hyperplane which separates the data into classes.The SVM is usually considered effective for datasets in which the number of feature dimensions is greater than the number of samples.Meanwhile, ensemble learning can improve the performance of machine learning through combining the outcomes from several models [19].Aggregation of predictions of multiple models usually leads to better predictive performance compared to a single model.Indeed, ensemble methods have been widely used to solve many realistic problems.For example, the "Netflix prize" is aimed at improving the accuracy of predictions about how their customers rate a movie based on their previous preferences [20].Many winners adopted ensemble learning in their recommendation engines, thereby achieving substantially improved performance.
Herein, we report results of ML for OPV applications using SVMs and compare the performance with the RF model.More importantly, the device PCEs were predicted only based on the fingerprints of the chemical structures.We found that the prediction performance of the models, which have been trained only with chemical fingerprints, was comparable with previous results [17].In other words, the PCE values of the potential polymer candidates can be predicted without knowing any preliminary material properties.Further, the prediction accuracy of PCEs from the chemical structures was further improved by "blending" the results of SVM and RF models.The resulting ensemble learning algorithm might open up a new opportunity for more precise, high-throughput virtual screening of conjugated polymers for organic solar cells.

Experiment
The simplified molecular input line entry system (SMILES) codes and average PCE values were obtained from the previous results of Nagasawa et al. [17].This dataset consists of 1203 polymers and the corresponding material and device properties.As illustrated in Figure 1, the chemical structures of the polymers were firstly converted to the repeating units.Then, the units were further transferred to SMILES codes.We used RDKit with python API to generate the chemical fingerprints from the SMILES codes [21].There are many types of fingerprints to represent the chemical structures, including the molecular access system (MACCS) [17,22] and extended connectivity fingerprint (ECFP) [17,23].In this work, we choose circular fingerprints, built by applying the Morgan algorithm, to set bits for the SMILES codes [21].The number of bits was 2048 bits per hash.While converting the chemical structures to Morgan fingerprints, the radius of the fingerprint (r) is one of the important parameters, which takes the connectivity information into account.As the number "4" in ECFP4 corresponds to the diameter of the atom environments, Morgan fingerprints generated by the RDKit with a radius of 2 are roughly equivalent to ECFP4.During the ML processes, the whole dataset was split into training and testing subsets; 25% of the dataset was include in the testing split.Note that the same split subsets were used for different models in order to obtain consistent evaluation results.Figure 1 displays the typical process flow of the whole ML prediction.After obtaining the Morgan fingerprints from the original chemical structures, the training subset together with the corresponding PCE values was fitted into the models.Then, the model accuracies were evaluated by applying the testing subset and the corresponding PCE values.

Results and Discussion
We initially chose a Morgan radius of 2, which is similar to ECFP4, to generate chemical fingerprints.The radius of 2 indicates that the maximum diameter of the circular neighborhoods is 4 in the molecule.The number is usually sufficient for similarity searching.Figure 2(a) shows the initial results of the optimized SVM model while the radius of the Morgan fingerprint was set at 2. The validation of the model was performed with 301 samples (25% of the dataset).A correlation coefficient (R) was used to evaluate the model performance.The maximum R value is positive one, which indicates a perfect match between the predicted and experimental PCE values [17].The R value of this model was 0.587.The major problem of the result was the overestimated PCE values for the samples with low experimental PCE; the PCE was also underestimated once the experimental PCE was larger than ~7%.
In order to further improve the accuracy, we mapped the effects of the Morgan radius on the model accuracies.We performed ten times simulation and obtained the correlation coefficient for each run.Figure 2(b) displays the influence of the Morgan radius on the performance of the SVM model.We could see that the accuracy depended slightly on how the dataset was split.The standard deviation (σ) of the ten runs was as small as 0.003 while the Morgan fingerprint was set at 2. On the other hand, the average R value was improved roughly with the increasing radius although we still observed some fluctuations (Figure 2(b)).In order the obtain a robust model, we arbitrarily selected the Morgan fingerprint of 5, which means that it considered 10 nearest atom neighborhoods in the molecules for the following works.As a result, the average R value of the SVM model achieved 0 633 ± 0 018.
After the determination of the parameters for the Morgan fingerprint, another ten times calculations were 2 International Journal of Polymer Science performed; the same split dataset was used for other two models, which will be discussed later, for each run.Figure 3(a) displays a typical result of the SVM; the correlation coefficient was 0.627.This R value is comparable with that reported earlier, indicating acceptable results of the SVM model [17].The performance of the SVM model was 3 International Journal of Polymer Science also compared with that of the RF algorithm.Figure 3(b) revealed the result of the RF model applying the same split dataset; the R value was 0.640.It seems like that the RF model was slightly better than the SVM algorithm.One should note that the predication was only based on the information of SMILES of the polymers.In other words, we predicted the PCEs solely from the chemical structures of the polymers.Previously, Nagasawa et al. applied the digital keys, either MACCS (166 digital keys) or ECFP6 (1064 bits), together with the information about energy levels of the highest occupied molecular orbital (HOMO), E g , and M w of the polymers, to the RF model [17].In our cases, we only adopted Morgan fingerprints with 2048 bits as the input and very similar performance was achieved.The results were indeed not surprising.The addition of the other information would only add 3 additional bits in our inputs, which only led to trivial effects on the prediction results.Therefore, our work suggests that one can possibly predict the PCE of a particular polymer even without any preliminary knowledge about material properties.
As we indicated in the introduction, ensemble learning is one way for improving the model with the current dataset.In fact, RF itself is one kind of ensemble learning; it builds numbers of decision trees and aggregates the results to obtain   Structure 7 is the one that we proposed; it was predicted to have the highest PCE value among the seven compounds in this figure.International Journal of Polymer Science a more accurate and stable prediction [17].Because we have two models, SVM and RF, with similar prediction ability, we, therefore, considered to "blend" the predictions from the two models to increase predictive accuracy.We used the same training set to train the models individually and simply averaged the predictive PCE values of the same test set from the two models.The resulting normalized PCE values of the ensemble learning are illustrated in Figure 4(a).The typical performance of the ensemble learning was better than those of the SVM and RF models.Especially, the number of the data points extremely deviated from the ideal diagonal line (R = 1), implying perfect prediction, was reduced after the output data from the individual model was averaged.As a result, the R value from ten runs was increased to 0 653 ± 0 015, indicating that the predictive performance was indeed improved using such ensemble learning.The R values of the three models are summarized in Table 1.
In order to demonstrate the function of the model, we applied a dataset from the previous report, which contains 316 polymers [24].Using the above ensemble model, we screened the compounds and selected six compounds with the highest PCE values; the repeating units are depicted in Figure 5.Note that we employed all the 1203 polymers in order to have the best performance.The order of the compounds followed the decreasing trend of the predictive PCE values.For instance, compound 1 was predicted to exhibit the highest PCE value; the second place was compound 2.
From the structures, one common feature of these six compounds is the presence of fluoride atom(s).Fluorination of the conjugated polymer backbone has been considered as a promising approach for the development of highperformance polymers in OPV devices [25].The addition of fluoride atoms would lower the HOMOs of the donor polymers, thereby increasing the open-circuit voltage.Moreover, fluorination also leads to the planarization of the backbone.The better morphology of the polymer blends possibly improves the charge transportation [25].Therefore, OPVs prepared with these compounds should exhibit high PCEs.This is indeed a very important research trend over the past few years in this field.
Based on the chemical features from the above results, we propose one new structure (structure 7).The predictive PCE value of the compound was even higher than that of compound 1 from the model.Further synthesis works should be done to confirm our prediction.The source code of the models as well as the prediction works can be found in reference [26].
In view of the current predictive results, apparently, the performance is still far from being satisfactory for virtual screening.From our current knowledge of OPV devices, the power conversion efficiency is very sensitive to the device processing conditions, including polymer purities, processing solvents [27], the additives [28,29], the annealing methods [30,31], device structures [32,33], and interfacial materials [34].The measurement conditions, for example, tested in a N 2 -filled glove box or after encapsulation and the environment temperature, will also affect the value of PCEs.These variations make accurate prediction very difficult.Therefore, it is very important to standardize these conditions to improve the quality of the raw data.The other method is to setup one protocol to digitize these variations, which allows the machine to systematically learn the parameters.We expect that the predictive performance can be substantially improved once we improve the quality of the data and/or dramatically increase the number of data entries.

Conclusion
In this work, we demonstrated that the PCE values of the bulk-heterojunction OPV devices can be predicted just using the information about the chemical structure of the polymer donor in the device.A correlation coefficient higher than 0.60 could be obtained using SVM and RF models.The predictive performance was comparable with that of the RF algorithm using inputs considering other properties, such as band gaps and molecular weights.The results of these two models were further ensembled to generate more accurate predictions.Because the ML reported herein does not require huge calculation capability, we anticipate that such ensemble learning algorithm can pave a new avenue for highthroughput virtual screening with even higher prediction accuracies in the near future.Although the accuracies of the models revealed in this work are still far from being satisfactory for virtual screening, we believe that the results reported herein have already pushed the virtual screening of organic materials for solar applications one step further toward precise prediction and design of high-performance materials using artificial intelligence.

Figure 2 :Figure 1 :
Figure 2: (a) The predictive result of the SVM model; the radius (r) of the Morgan fingerprint was two.Note that the experimental and predicted PCE values have been normalized.The diagonal line implies the positive perfect correlation (R = 1).(b) Effect of the Morgan radius on the performance of the SVM model.Note that the diameter of each bubble represents the relative size in quantity for the standard derivatives of the correlation coefficients.

Figure 3 :Figure 4 :
Figure 3: (a) The predictive result of the SVM model; the radius of the Morgan fingerprint was five.(b) The predictive result of the RF model using the same split dataset.

FFigure 5 :
Figure5: The chemical structures(1)(2)(3)(4)(5)(6) of the repeating units that we predicted from the other dataset consisting of 316 compounds.Structure 7 is the one that we proposed; it was predicted to have the highest PCE value among the seven compounds in this figure.

Table 1 :
Summary of the three models in this work.Note that the same split dataset was used for model training and testing.