Discovery of Crystallizable Organic Semiconductors with Machine Learning

Crystalline organic semiconductors are known to have improved charge carrier mobility and exciton diffusion length in comparison to their amorphous counterparts. Certain organic molecular thin films can be transitioned from initially prepared amorphous layers to large-scale crystalline films via abrupt thermal annealing. Ideally, these films crystallize as platelets with long-range-ordered domains on the scale of tens to hundreds of microns. However, other organic molecular thin films may instead crystallize as spherulites or resist crystallization entirely. Organic molecules that have the capability of transforming into a platelet morphology feature both high melting point (Tm) and crystallization driving force (ΔGc). In this work, we employed machine learning (ML) to identify candidate organic materials with the potential to crystallize into platelets by estimating the aforementioned thermal properties. Six organic molecules identified by the ML algorithm were experimentally evaluated; three crystallized as platelets, one crystallized as a spherulite, and two resisted thin film crystallization. These results demonstrate a successful application of ML in the scope of predicting thermal properties of organic molecules and reinforce the principles of Tm and ΔGc as metrics that aid in predicting the crystallization behavior of organic thin films.


SI-1.1 Datasets preparation for virtual screening
To enable the virtual screening for crystallizable OSCs, we performed independent modeling of two chemical properties, the melting temperature (T m ) and the enthalpy of melting (∆H m ).To achieve this, we compiled datasets from diverse sources and preprocessed them independently.The melting temperature dataset was assembled from the United States Patent and Trademark Office (USPTO) patent records S1 containing measured T m for approximately 220 thousand molecules and T m data from Tetko et al.S2 of approximately 47 thousand molecules.Experimentally measured fusion (melting) enthalpy data were sourced from Acree et al.S3 S4 S5 leading to a dataset of approximately 5 thousand molecules.Melting temperature (T m ) and the enthalpy of melting (∆H m ) data were combined across both thermal properties independently, and duplicated records were treated according to Fourches et al.S6 .Molecular structures, represented in SMILES format, were standardized using MolVS S7 .

SI-1.2 Model development
In this study, two machine learning models were developed based on the datasets for T m and ∆H m independently.Molecules were featurized by computing descriptors available through AlvaDesc 2.0 S8 .The preprocessing of molecular descriptors was conducted using the Scikitlearn library S9 according to the following protocol independently for each dataset.First, each individual descriptor was scaled via MinMax scaler.Second, scaled descriptors with a variance of less than 0.01 were filtered out.Third, for the remaining subset of descriptors pairwise spearman correlation matrix was computed and only one descriptor from each pair of cross-correlated (descriptors with absolute spearman correlation of more than 0.9) was kept, leading to creation of preprocessed dataset suitable for modelling.Preprocessed dataset consisted of 476 descriptors (full list in Table S1) for T m and 272 descriptors (full list S-4 in Table S2) for ∆H m .Gradient Boosting Decision Trees (GBDT) algorithm from XGBoost library S10 was used to build both (T m and ∆H m ) ML-models.It is known that GBDT family of methods are one of the best ML method for tabular data S11 .Each model hyperparameters were optimized with OPTUNA library S12 in a nested 5-fold-cross-validation-loop.The performance of each model was measured in an outer loop of nested 5-fold-cross-validation leading to: mean absolute error (MAE) 32 ± 1 K and MAE 6.7± 0.2 kJ/mol for T m and ∆H m , respectively (for details see Figures S4 and S5).Both models with the best hyperparameter set were refitted using full T m and ∆H m preprocessed datasets for the virtual screening.S3 ) suitable for expert assessment.

SI-1.4 t-SNE analysis
Focused library, ML library (see Figure 1) and previously S14 characterized materials were featurized by computing descriptors used for ML-modelling of T m and ∆H m (full list of descriptors provided in Table S1 and Table S2) ) available through AlvaDesc 2.0 S8 .Then each descriptor was scaled individually via MinMax scaler and t-SNE 2D projection was done using the Scikit-learn library S9 .
Table S3: List of 44 molecules in Focused ML Library with their functional annotation.
POM and XRD measurements.There was no exposure to atmosphere between deposition and annealing.

SI-1.7 Equipment and characterization
Images of the annealed crystalline films were taken with a polarized optical microscope (POM), model Olympus BX60F5.Ellipsometry was performed using a J.A. Woollam M2000 variable angle spectroscopic ellipsometer and the resulting data was analyzed using Com-pleteEASE software; this technique was employed for tooling the thermal evaporator and determining thin film thickness.The empirical thermal properties of the organics studied here were obtained using a TA Instruments differential scanning calorimeter (DSC) 2500 equipped with an RCS90 cooler.The furnace chamber of the DSC was continuously flushed with nitrogen.For each material, approximately 5 mg of the as-received materials were loaded into a hermetically-sealed, aluminum pan with a small hole on the sealed lid; the reference pan also had a small hole to match the conditions of the sample pan.The DSC was calibrated for temperature using a melting temperature measurement for indium and for heat capacity using sapphire standards.For the secondary DSC scan of CZBDF seen in Figure S3, a PerkinElmer DSC-8500 Differential Scanning Calorimeter was used with an empty aluminum pan as reference.The scan rate for all DSC scans was 10 • C/min.The thermal properties obtained by DSC reported here were taken at the onset values of each thermal event.X-ray diffraction (XRD) was performed using a Bruker D8 Discover X-ray diffractometer with a copper source, wavelength 1.54 Å.
Table S4: A summary of the optimized fabrication conditions for each crystallized material in this work.All underlayer materials are 5 nm thick and deposited between the ITO and the organic film to be crystallized.

≥
550 K and ∆G pred c < -7.5 kJ/mol were selected according to the heuristic rule of the crystallization driving force at T c as a function of T m for the materials developed previously S14 .That led to 44 candidate molecules (see Table

MoleculeFigure S1 :Figure S2 :
Figure S1: (a) POM image of CZBDF after post-deposition annealing and (b) XRD of CZBDF grown on ITO with post-deposition annealing.Scan is featureless with the exception of a peak at 2θ = 30.2°fromthe ITO substrate.

Figure S3 :
Figure S3: DSC scan of CZBDF showing initial heating (a) and the following cooling (b).A glass transition and melting peak are observed upon heating, and a crystallization peak is seen upon cooling.

Figure S4 :
Figure S4: T m ML model performance measured over 5-fold Cross-Validation.

Table S1 :
List of AlvaDesc's molecular descriptors used for building T m ML-model

Table S2 :
List of AlvaDesc's molecular descriptors used for building ∆H m ML-model

Table S2 :
Continued number of rotatable bonds (NumRotableBonds ≥ 3).The resulting focused library of 7742 molecules reflects the relevant chemical space of primarily aromatic molecules.All properties were computed via RDKit library S13 ; Aromatic Proportion is defined as the ratio of Heavy Atom Count in an Aromatic State to total Heavy Atom Count.For each molecule in the focused library, ∆H pred