Imputation of missing gas permeability data for polymer membranes using machine learning

Polymer-based membranes have the potential for use in energy efficient gas separations. The successful exploitation of new materials requires accurate knowledge of the transport properties of all gases of interest. Open-source databases of gas permeabilities are of significant potential benefit to the research community. The Membrane Society of Australasia (https://membrane-australasia.org/) hosts a database for experimentally measured and reported polymer gas permeabilities. However, the database is incomplete, limiting its potential use as a research tool. Here, missing values in the database were imputed (filled) using machine learning (ML). The ML model was validated against gas permeability measurements that were not recorded in the database. Through imputing the missing data, it is possible to re-analyse historical polymers and look for potential “ missed ” candidates with promising gas selectivity. In addition, for systems with limited experimental data, ML using sparse features was performed, and we suggest that once the permeability of CO 2 and/or O 2 for a polymer has been measured, most other gas permeabilities and selectivities, including those for CO 2 /CH 4 and CO 2 /N 2 , can be quantitatively estimated. This early insight into the gas permeability of a new system can be used at an initial stage of experimental measurements to rapidly identify polymer membranes worth further investigation.


Introduction
Membranes with polymers as the selective layer have been widely used for the separation of gas mixtures including those of key relevance to energy and the environment [1][2][3][4]. The development of new polymers with improved gas permeability and selectivity would enhance the efficiency of membrane gas separations of industrial interest [5]. Polymers have been developed for various purposes including hydrogen recovery during ammonia preparation (H 2 from N 2 ) [6,7], oxygen or nitrogen enrichment of air (O 2 from N 2 ) [8,9]; and natural gas sweetening or biogas upgrading (CO 2 from CH 4 ) [10][11][12]. Rising concern about global warming by greenhouse gas emissions has focused attention also on pre-combustion or post-combustion carbon capture (mainly H 2 from CO 2 , and CO 2 from N 2 , respectively) [13,14]. Membranes with high permeability are desired for industrial application at large scales, however, there is a well-known trade-off between gas permeability and the gas selectivity for a gaseous mixture, with an upper bound for each gas pair quantified by Robeson in 1991 [15] and updated in 2008 [16]. Subsequent effort in polymer design and synthesis has pushed the Robeson upper bound towards polymers with both higher permeability and better selectivity, resulting in recently revised upper bounds [17,18]. However, since experimental analysis of the transport properties of novel materials can be time consuming and accurate studies require specialized equipment, many studies are limited to a single gas pair [19]; or to a few gases [20]. It is likely that there are missed opportunities, where polymers with promising gas selectivity and permeability for a different gaseous mixture than those tested are missed. Conversely, for rapid screening of potential polymers, it would be advantageous to assess the full potential based on fewer gas permeability measurements, helping focus experimental effort on the most promising systems.
The Membrane Society of Australasia (MSA) hosts the public Polymer Gas Separation Membrane Database, which was launched online in 2012, and allows access to gas permeability data for a large number of polymers published from 1950 to 2018 [21]. Initially, the resource consisted of data collated by Robeson, who empirically observed and characterized the upper bound phenomenon in 1991 [15] and again in 2008 [16], reflecting the growing interest in energy-efficient separations using membranes. The database now contains over 1500 data points. The philosophy of the database is for it to be open, with anyone able to freely add or edit the database, but the content is checked regularly to ensure the data-points are correctly referenced. Gas permeability measurements originally included hydrogen, oxygen, nitrogen, carbon dioxide and methane. Later the measurements were extended to vapours such as ethylene, ethane, propene, propane, butene, butane, carbon tetrafluoride, hexafluoroethane and octafluoropropane. The membrane materials included cover a range of rubber and glassy polymers, carbon sieves, zeolites and mixed composites. However, not every entry in the database contains the experimentally reported values for every gas listed above. Due to the widespread use of the Polymer Gas Separation Membrane Database by researchers in academia and industry (approximately 1,000 views per month in 2019 and 2020), imputation of the database is desirable. In statistics, imputation refers to the process of replacing missing data with substituted values. With an accurate imputation model, one can not only retrieve candidates with good gas selectivity that were not measured at the time of publication, but also get a more complete database for future experimental and theoretical study. In addition, experimental measurement of the gas permeability of previously reported polymers would be time consuming and expensive, especially when the likelihood of publishing such studies in a formal journal article is small. It is thus highly desirable to develop an easily accessible computational model to estimate the permeability of certain gases when the original experimental data was not reported.
Machine learning (ML) methods have been developed and applied to polymers for predicting properties including glass transition temperature [22], dielectric constants [23], the gas permeability of polymers [24], and the discovery of novel functional polymers [25]. One of the main models for predicting polymer membrane performance is group contribution theory, where the chemical structure of a polymer is divided into smaller fragments and the fragments used in various ML models as input features [26][27][28]. Recently, hierarchical methods for fingerprinting polymers for property prediction have also been reported [29]. Such models were built upon chemical structures of polymers and are of great value for identifying structure-property relationships. However, the gas permeability of the same polymer is often measured under different conditions, for example, different solvent treatment or degree of aging, and ML models based upon polymer fingerprints cannot distinguish the difference between these conditions. The Polymer Gas Separation Membrane Database often holds data for the same polymer tested under different conditions, in different laboratories with different instruments, and a ML model relying purely on chemical structure alone would not be sufficient for filling the missing values for gas permeability.
An alternative way of imputing the database is to predict the permeability of unknown gases based on data for gases with known permeability. As suggested by Alentiev et al., the logarithm gas permeability coefficients P i and P j of gases i and j are strongly correlated [30], thus it is plausible to predict the gas permeability of gas i using the permeability data for other gases without requiring any information on the molecular structure of the polymers or experimental conditions. In this paper, we developed both linear and non-linear ML models to "learn" the relationship of permeability of different gases recorded in the Polymer Gas Separation Membrane Database and impute the missing gas permeability in the database using the ML models. An overview of the approach is shown in Scheme 1. It is possible to uncover additional, but previously unknown, properties of existing polymers in the database. We do not aim to discover any novel gas selective polymers in this paper; however, the open-source ML model we present could be used in the future to impute the gas permeability data of novel polymers at an early stage of experimental measurements and thus help to accelerate the identification of polymer membranes worth further experimental investigation.

Methods
The Polymer Gas Separation Membrane Database was downloaded from the online portal of the Membrane Society of Australasia (MSA) on 11/06/2020 at https://membrane-australasia.org/msa-activities/po lymer-gas-separation-membrane-database/. We focused on data for the commonly measured gases He, H₂, O₂, N₂, CO₂ and CH₄ and removed datasets that did not contain gas permeability data for at least one of these. We were left with a database of 1,378 entries, and the number of missing values for the permeability of each gas in the target database is shown in Table 1. The gas permeability of polymers was recorded in Barrer (1 Barrer = 10 -10 ⋅cm 3 (STP)⋅cm⋅cm -2 ⋅s -1 ⋅cm Hg -1 ), in this study the gas permeabilities were converted to logarithm with base 10 values, since the logarithm values are used to define the empirical Robeson upper bounds of gas selectivity [15,16].
Missing value imputation of the Polymer Gas Separation Membrane Database was performed using the Multivariate Imputation by Chained Equations (MICE), which 'fills in' the missing data in a dataset through Scheme 1. Overview of our workflow. We imputed the existing Polymer Gas Separation Membrane Database using machine learning, where previously reported polymers in the database that miss gas permeability values can be re-analysed and these gaps filled. An imputed database opens the potential for identifying promising polymers and the developed machine learning model has the potential to take incomplete datasets for novel polymers and impute them in seconds to allow the evaluation of which systems should be the focus of continuing experimental effort.
an iterative procedure of predictive models [31]. In each iteration, the missing values of a specific variable are predicted with the predictive model using other variables in the dataset. The pseudo-code of the MICE algorithm is shown in Algorithm 1 in the Supporting Information.
Here, a linear model and a non-linear model were selected as the predictive model in the MICE algorithm, which were the Bayesian Linear Regression [32] and the Extremely Randomized Trees [33], respectively. Predictive performance of these two models on the test set were compared. The Bayesian Linear Regression (BLR) is an approach for linear regression where the statistical analysis is undertaken with Bayesian inference, assuming that the regression model has errors that have a normal distribution; while the Extremely Randomized Trees (ERT) implements a meta-estimator that fits a number of randomized decision trees on various subsamples of the dataset and uses averaging to improve the prediction accuracy and control over-fitting. In this study, the ERT model was composed of 100 decision trees. The missing value imputation of the Polymer Gas Separation Membrane Database was performed using Python 3.7.1 and Scikit-learn 0.21.2 [34]. The code for imputing the database is available at github.com/qyuan7/polym er_permeability_imputation.
The test set in this work was selected from papers published in 2019 and 2020 reporting gas permeability of polymers of intrinsic microporosity (PIMs) [18,35,36] and polyimides [37][38][39][40][41][42], which have not been recorded in the Polymer Gas Separation Membrane Database. Performance of the ML models on the test sets was measured in a round-robin manner with "dense features", for example, to test the model on prediction of permeability of H 2 , the permeability data of H 2 was dropped from the test database, and the data of H 2 was modelled as a function of other gases in the test database. To examine the ability of the imputation models for cases where only limited permeability data is available, test sets with "sparse features" were also used, where the gas permeability data of only one gas was used to predict the permeability for all other gases, for example, predicting the gas permeability of He, O 2 , N 2 , CH 4 and CO 2 using the gas permeability data of H 2 . The performance of the ML model on the test set was measured by the rooted mean squared error (RMSE) between the logarithm gas permeability obtained by ML prediction and the experimentally reported values as defined in equation (1), where n is the number of data points, p i is the experimentally reported logarithm gas permeability of polymer i, and p i is the logarithm gas permeability of polymer i prediction using the ML model: The ability of the ML models to predict the gas selectivity of polymers was measured by a classification problem, where the ML models were used to predict whether polymers in the test set had gas selectivity beyond the Robeson 2008 upper bound. Polymers with gas selectivity above the Robeson 2008 upper bound were regarded as "positive", while those below the Robeson 2008 upper bound were regarded as "negative". The gas permeabilities of polymers were evaluated using the ML models to determine if they were predicted "positive" or "negative" in the Robeson diagram. "True positive" represents polymers that were positive from both experimental measurements and ML prediction; "False positive" represents polymers that were positive from ML prediction but negative from experimental measurements; "True negative" represents polymers that were negative from both experimental measurements and ML prediction, and "False negative" represents polymers that were negative from ML prediction but positive from experimental measurements. We computed the accuracy, precision, and recall scores for identifying the polymers with gas selectivity above the Robeson 2008 upper bound. In this study, accuracy refers to the fraction of correct predictions from all predictions made, precision refers to the fraction of "true positive" values from values that were predicted as "positive", and recall refers to the fraction of "true positive" values from all values that were "positive" experimentally. The accuracy, precision and recall scores are defined in equations (2)

Comparison of the BLR and ERT imputation results
A comparison of the BLR and ERT imputation results is shown in Fig. 1. The BLR and ERT imputation results are highly correlated apart from a few outliers, and no systematic error between the two imputation methods is observed, with neither of the two imputation methods giving constantly larger or smaller predictions than the other. As shown in Fig. 1, the RMSE of the logarithm gas permeability obtained from the BLR and ERT imputations ranged from 0.07 to 0.26, with the largest disagreement observed for the CH 4 data. This is possibly because the data for CH 4 has a relatively weak correlation with the data for other gases, as shown in Fig. S1, which is in part due to the relatively low permeability of CH 4 in most glassy polymers, and therefore the measurement may have a lower accuracy than that of other gases. Furthermore, CH 4 has the largest effective diameter of the gases considered in this work, and is thus more affected by variations in the sample history, physical aging and measurement conditions [43]. The fact that both the linear model BLR and non-linear model ERT produced highly correlated imputation results indicates that the MICE algorithm is relatively robust against the choice of the predictive model type. We have provided the imputed database obtained from both the BLR and ERT model in the supporting information and at github.com/qyuan7/polymer_permeabili ty_imputation. In addition, the standard deviation of the BLR imputation is provided to give prediction confidence intervals.

Validation of the imputation models on the test set
We selected publications with experimental data not recorded in the Polymer Gas Separation Membrane Database for PIMs [18,35,36] and polyimides [37][38][39][40][41][42]. Representative molecular structures of the PIMs and polyimides are shown in Fig. 2. The test set contained experimental gas permeabilities of 50 PIM entries and 37 polyimide entries. As can be seen from Fig. 2, there is structural diversity in the test sets. In addition, polymers in the test exhibit a wide range of gas selectivity, as shown in Table S1. For example, the range of CO 2 /CH 4 selectivities in the test set is 3.2-75.0 and the range of CO 2 /N 2 selectivities is 6.8-36.5.
Performance of the BLR and ERT imputation models was compared by computing the RMSE between "predicted" logarithm gas permeability and the experimental logarithm gas permeability reported in the literature, as shown in Table 2. The BLR model was more accurate in the predictions for the gas permeability of PIMs than the ERT model, while the performance of the two models were comparable for polyimides, except that the ERT model had significantly larger errors for the H 2 Table 1 Number of missing values for the gas permeability in the Polymer Gas Separation Membrane Database of each gas. The total number of data points for the permeability of each gas was 1,378 in this study. permeability. The BLR model is more accurate than the ERT model in general on the test set with "dense features", where the permeability of one gas was predicted using the permeabilities of all other gases; and the discussion in this study for validation with "dense features" is primarily based on the predictions of the BLR model. Correlation of the experimentally reported gas permeability and the BLR model predictions is shown in Fig. 3. According to Table 2 and Fig. 3, the BLR model had the largest error in predicting the CH 4 and CO 2 permeability, and the smallest in O 2 permeability. From Fig. 3 it can be seen that the BLR model systematically underestimated the CO 2 permeability for almost all the entries in the test set, while no obvious systematic error is   [37]; (e) Imidazole containing polyimide [42]; (f) Polyimides based on the diethyltoluenediamine isomer mixture [38]. The most likely explanation for the model underestimating the CO 2 permeability is that researchers have been working towards improving the gas permeability by increasing the amount of free volume (or microporosity) of the polymers. According to the solution-diffusion model of gas transport [44], greater free volume enhances both gas diffusivity and solubility with the latter being particularly high for PIMs relative to conventional polymers. Thus, the pairwise relationship between different gases has changed over time, and the samples from the test set belong to the latest generation of polymers with relatively high CO 2 permeability. The Robeson diagrams showing the position of polymers in the Polymer Gas Separation Membrane Database for the selectivity of CO 2 /CH 4 and CO 2 /N 2 are shown in Fig. S4. A chronological increase can be observed in the gas selectivity, especially when comparing the gas selectivity of polymers reported after 2010 and those reported before 2000. A time series analysis for removing the error incurred by the time-dependent nature of the database was performed, where data points in the Polymer Gas Separation Membrane Database were classified to smaller datasets by the decade of publication, and imputation of the smaller datasets were performed and validated against the test set. However, due to the existence of missing values and the inconsistent number of data points per decade in the database, the imputation results were not improved. As a result, we used the entries in the database as provided, without performing any time-based corrections, and the uncertainty in predicting the CO 2 permeability is represented by the standard deviation of the BLR prediction, as provided in a raw data file as additional Supporting Information.
The most important property for gas separation membranes is to have a high permeability in combination with a high selectivity for the gas pair of interest, which can be examined from the Robeson diagram. We measured the performance of the imputation models using a two-class classification task: polymers with gas selectivity above the Robeson 2008 upper bound were regarded as "positive", and those below the Robeson 2008 upper bound were regarded as "negative". For both the BLR and ERT model, the gas permeabilities of interest were calculated using the permeability of other gases (the prediction using "dense features"), and the positions of the calculated values in the Robeson diagram were computed. The model performance was then evaluated by whether the correct label was assigned to the polymers in the test set. Two of the most reported gas pairs, CO 2 /CH 4 and CO 2 /N 2 , were considered, and we have simulated three cases of gas permeability missing for each gas pair. For the CO 2 /CH 4 selectivity, for example, we applied our imputation model to the test set under three parallel assumptions: the permeability for both CO 2 and CH 4 are missing; only the permeability for CH 4 is missing; and only the permeability of CO 2 is missing. For all three cases, we have evaluated the missing gas permeabilities using the permeabilities of all other gasesthe "dense features", and the accuracy, precision and recall scores for the BLR prediction of CO 2 /CH 4 and CO 2 /N 2 selectivity are shown in Table 3, and the scores for the ERT prediction of CO 2 /CH 4 and CO 2 /N 2 selectivity are shown in Fig. 3. Correlation of BLR prediction and the experimental report of the gas permeability of PIMs (orange data points) and polyimides (blue data points) in the test set. The same comparison using the raw gas permeability in Barrer is shown in Fig. S3 on a linear scale. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.) Table 3 Accuracy, precision, and recall score for the BLR in predicting the polymers with gas selectivity above the 2008 Robeson upper bound with permeabilities of different gases missing: the accuracy, precision and recall scores are in the range of 0-1, where the closer a number is to 1, the better the model.

Model
Gas  Table S2 in the Supporting Information. The accuracy scores of the BLR model for both gas pairs in all three cases are higher than 0.8. It should be noted, however, for cases where the permeability for both CO 2 and CH 4 (similarly for both CO 2 and N 2 ) are missing, the precision scores and recall scores are rather imbalanced: the precision for almost all predictions in Table 3 is close to perfect, while the recall score were 0.76 and 0.59 for CO 2 /CH 4 and CO 2 /N 2, respectively. Such an imbalance indicates that the imputation models are "useful" but not "complete" for cases where the permeability data for both gases of interest is missing: polymers predicted to have good gas selectivity are highly likely to be gas selective following experimental measurements, however, a considerable percentage of the polymers with good gas selectivity are misclassified as "negative" by the BLR model. For cases where the permeability of one gas (CO 2 or CH 4 for the selectivity of CO 2 /CH 4 ) is missing, the BLR model is much more robust compared to the cases where permeability data of both gases is missing, where the accuracy, precision and recall scores ranged from 0.80 to 1.00. It should also be noted that in Table 3, for CO 2 /CH 4 and CO 2 /N 2 , the accuracy, precision and recall scores were all higher than 0.90 for cases when the only missing data was the CH 4 or N 2 permeability. For such cases, the imputation models are both "useful" and "complete": robust predictions about the gas selectivity can be made if the permeability for only CH 4 or N 2 is missing.
The experimentally measured and BLR predicted positions of data points in the test set for cases where only the CH 4 or N 2 permeability is missing are shown in Fig. 4. The data cloud of the BLR prediction for both CO 2 /CH 4 and CO 2 /N 2 overlapped with the experimental reports greatly, which is in agreement with the high accuracy, precision and recall scores for the corresponding cases. It is thus possible to identify the future polymers with high gas selectivity when not all the gas permeability data is available, or to evaluate the gas selectivity of a previously reported polymer when the gas permeability data is missing for one or more gases.
The CO 2 /CH 4 and CO 2 /N 2 selectivity for polymers has been studied extensively, and it is believed that mobility and sorption both favour the permeation of CO 2 and so making predictions with this pair is probably relatively easy. We also investigated the selectivity of H 2 /CO 2 using the test set, where sorption and mobility selectivity are opposed for this gas pair. The accuracy, precision and recall scores for identifying polymers against the Robeson 2008 upper bound are shown in Table 4. These scores were all above 0.83 for prediction when H 2 permeability is missing, while for cases when CO 2 or both CO 2 and H 2 permeability data is missing, the precision of the imputation model decreased considerably. Therefore, our imputation model can be used to evaluate the H 2 / CO 2 selectivity without experimentally measuring the H 2 permeability. This may become of considerable practical relevance given the rapidly increasing interest in H 2 as a fuel and given that H 2 production methods require a H 2 /CO 2 separation step, for instance from syngas.

Identifying promising candidates in the Polymer Gas Separation Membrane Database
The Polymer Gas Separation Membrane Database contains inputs of which some or all permeability data for CO 2 , CH 4 and N 2 was missing. Upon imputation of the database, the gas selectivity of the candidates with missing values were examined using the imputed gas permeability to identify potential candidates with good CO 2 /CH 4 and CO 2 /N 2 selectivity. First of all, before seeking potential candidates whose selectivity was not reported in the database, we asked the question: Would it have been possible to identify PIM-1 as a promising separation membrane based upon applying our imputation model to limited preliminary data? PIM-1 being the archetypal PIM system that initiated the current research interest in PIM separation performance [16]. We revisited the gas selectivity of PIM-1 by separately removing either the experimental N 2 or CH 4 permeability data for PIM-1 from our database and then imputing whichever of the two gases was missing using the BLR model. The comparison of the imputation result and the experimental report is shown in Table 5. The CO 2 /N 2 and CO 2 /CH 4 selectivities obtained from the BLR model were 25 and 18, respectively, which are close to the experimental measurements (23 and 17, respectively). In addition, the position of PIM-1 obtained via BLR imputation in the Robeson diagram is shown in Fig. 5a and b. It can be seen that PIM-1 lies close to the Robeson 2008 upper bound for both CO 2 /N 2 and CO 2 /CH 4 selectivity, indicating that our imputation ML model would have been able to identify PIM-1 as a promising separation membrane from limited initial experimental data.
Next, we moved on to explore whether our imputed database could reveal promising selectivities that were not originally reported in the Gas Separation Membrane Database. As shown in Fig. 5a and b, most of the candidates with missing values had potentially limited gas selectivity for CO 2 /CH 4 and CO 2 /N 2 . However, the KAUST-PI-1 reported by Pinnau et al. [17], of which the CO 2 permeability was not reported in the database, was found to have a predicted CO 2 /CH 4 selectivity above the Robeson 2008 upper bound and predicted CO 2 /N 2 selectivity close to the Robeson 2008 upper bound. The molecular structures of KAUST-PI-1 and PIM-1 are shown in Fig. 5c and d. Based purely on the ML   Fig. 4. BLR prediction and experimental reports of the CO 2 /CH 4 and CO 2 /N 2 selectivity in the Robeson diagram, with the cases for a) permeability data of CH 4 missing; b) permeability data of N 2 missing.

Table 4
Accuracy, precision, and recall score for the BLR in predicting the polymers with H 2 /CO 2 selectivity above the 2008 Robeson upper bound with permeabilities of different gases missing: the accuracy, precision and recall scores are in the range of 0-1, where the closer a number is to 1, the better the model.

Model
Gas predictions from existing data in the Polymer Gas Separation Membrane Database, we identified that KAUST-PI-1 has potentially high CO 2 /CH 4 selectivity and good CO 2 /N 2 selectivity. Our assumption for KAUST-PI-1 was confirmed by further review of the literature, where we found another report on KAUST-PI-1 by Pinnau et al. [45], which was not included in the Polymer Gas Separation Membrane Database. The permeability of KAUST-PI-1 for CO 2 , CH 4 , as well as N 2 was reported as an average value from two films. It was found that the KAUST-PI-1 exhibited excellent CO 2 /CH 4 selectivity, which was above the Robeson 2008 upper bound (as we predicted), while the CO 2 /N 2 selectivity was good but just below the Robeson 2008 upper bound (we predicted it to be close to the upper bound). The comparison of the CO 2 /CH 4 and CO 2 /N 2 selectivity of our prediction and the experimental measurement is shown in Table 5. The cross validation between experimental measurements that are not recorded in the Polymer Gas Separation Membrane Database and the ML prediction indicates that it is possible to re-analyse historical data and identify potentially "missed" polymers with promising gas selectivity using our ML imputation model.

Prediction of gas permeability from a single measurement
During the experimental testing of gas selectivity of new polymers, the gas permeability is usually measured sequentially, and these measurements take considerable time and effort. We gave the BLR and ERT predictors a more challenging, yet rewarding, task to impute the test set with sparse features by removing the gas permeability data of all but one gas and using the permeability of that one gas to predict the permeability for all the other gases.
The imputation of the test set was performed following the MICE algorithm using the BLR and ERT model as shown in Algorithm 1 and the RMSE for the predictions is shown in Table 6. The correlation between gas permeability of pairs of gases can be observed from the RMSE results in Table 6. For example, it can be observed that the permeability of H 2 and He are strongly correlated, since the permeability of H 2 solely is a strong feature in predicting the permeability of He, with RMSE of 0.05 and 0.10 for the BLR and ERT model, respectively. The permeability of He, on the other hand, is a rather weak feature in predicting the permeability of other gases. This is purely due to the lack of sufficient experimental data for He permeability in the membrane database, and therefore in our test set. Indeed, 48% of the polymers in the test set lack Table 5 Comparison of the CO 2 /CH 4 and CO 2 /N 2 selectivity of the ML prediction and experimental report for KAUST-PI-1 [45] and PIM-1 [46]. CO No  Yes  Experimental measurement  2398  22  23  No  Yes  PIM-1  BLR prediction  2300 c  23 c  17 c  No  No  Experimental measurement  2300  25  18 n/a n/a a Whether or not the CO 2 /N 2 and CO 2 /CH 4 selectivity is above the Robeson 2008 upper bound. b The CO 2 permeability was calculated using our BLR model, the N 2 and CH 4 permeabilities were collected from the Polymer Gas Separation Membrane Database.
The permeability data is in Barrer. c The CO 2 permeability of PIM-1 was reported by Ref. [16] and the CH 4 and N 2 permeability of PIM-1 was imputed using our BLR model. the experimental He permeability, thus permeability of He is a weak feature for a machine learning model. With more data points for the permeability of He experimentally measured and reported in the future, it would be possible to improve the predictive power using He permeability as a feature in the imputation model. With the imputation using sparse features, O 2 and CO 2 permeability was the strongest indicator of the permeability of the other gases. According to Table 6, the average RMSE of the BLR model for predicting permeability of other gases using data for O 2 and CO 2 are 0.25 and 0.27; and the RMSE of the ERT model using data for O 2 and CO 2 are 0.28 and 0.23, respectively. The order of reliability of prediction from permeability of a single gas for BLR model is O 2 > CO 2 > N 2 > CH 4 > He, and the order of reliability for the ERT model is CO 2 To simulate the scenario where the experimental permeability of a new polymer for only one gas has been measured and one wants to evaluate the gas selectivity of the polymer without experimentally measuring the gas permeability of the other gases, we examined specifically the performance of CO 2 permeability in predicting whether the polymer is above the Robeson 2008 upper bound for CO 2 /CH 4 and CO 2 / N 2 . The accuracy, precision and recall scores for the BLR and ERT prediction of CO 2 /CH 4 and CO 2 /N 2 selectivity using only CO 2 permeability are shown in Table 7. The ERT model outperformed the BLR model for both the selectivity of CO 2 /CH 4 and CO 2 /N 2 in the "sparse feature" case. It should be noted that for the BLR model, the recall scores are very low, and the precision and recall for CO 2 /CH 4 are both 0.00, which indicates that according to the BLR model, all polymers in the test set are "negative". The ERT model, on the other hand, yields robust prediction scores for both the CO 2 /CH 4 and CO 2 /N 2 selectivity, except that the recall score for CO 2 /CH 4 selectivity is moderate. The reason for the ERT model in outperforming the BLR model in the "sparse feature" case might be that the linear BLR model learned a stricter relationship between the pairwise gas permeability from the Polymer Gas Separation Membrane Database. This enabled accurate prediction of gas permeability in the "dense feature" case, however limited the generalizability of the model in the "sparse feature" case.
It should be noted that the ERT model is not deterministic and might give slightly varied results from different runs if different random seeds are used. In this study, we built the ERT model using the combination of 100 decision trees, which reduced the probability of high variance in the predictions. In addition, parallel ERT tests with different random seeds were performed and the RMSE across the ERT models with different seeds with "sparse feature" were smaller than 0.02. Thus, we believe that the ERT model is robust in predicting the CO 2 /CH 4 and CO 2 /N 2 selectivity from the permeability of CO 2 . We suggest here that once the permeability of CO 2 for some polymer has been measured, researchers can quantitatively estimate the permeability of N 2 and CH 4 to gain primary insight on the CO 2 /CH 4 and CO 2 /N 2 selectivity of that polymer using the ERT model. Similarly, if only one gas pair (CO 2 /CH 4 or CO 2 / N 2 ) is tested, this method is of high predictive value for the other gas pair. This may save time for future work, because less experiments will be needed to screen the potential performance of new materials, but it may be particularly helpful also in the evaluation of existing materials outside the application field for which they were originally developed. For instance, many polymers were studied for carbon capture from flue gas, where CO 2 /N 2 separation is relevant, but they may be equally interesting for the strongly emerging new application field of biogas upgrading, where CO 2 /CH 4 separation is important.
Although it does not have the full predictive power of other methods [24,29], the advantage of the models presented in this work is that they do not require any knowledge about the polymer structure and they work for polymers with different measurement conditions (such as aging and solvent treatment), which makes it a fast and versatile approach. For the rapid screening of polymers, especially those produced via high-throughput techniques, the prediction of the full range of gas permeability from a single rapid measurement could be highly beneficial to researchers, especially as the chosen gas may be selected based on avoiding stringent local safety regulations (e.g. for H 2 or CH 4 ) or high costs (e.g. for He). Our ML model for this purpose is open-source and thus available for all experimental researchers in the field to use. Our methodology must be used with caution for the evaluation of polymers that may have non-standard solubility selectivity due to enhanced interaction (e.g. amines for CO 2 ) or poor interaction (e.g. fluorinated polymers with CH 4 ) with a particular gas.

Conclusions
The missing values for the permeability of He, H 2 , O 2 , N 2 , CH 4 and CO 2 in the online Polymer Gas Separation Membrane Database of the Membrane Society of Australasia were imputed using the MICE algorithm combined with Bayesian Linear Regression and Extremely Randomized Trees. Based on the imputed database, we suggested that KAUST-PI-1 has potentially high CO 2 /CH 4 selectivity and good CO 2 / N 2 selectivity, which was confirmed by experimental work that was not recorded in the database. The imputed database can serve as the training set for future polymers for gas separation, and the gas permeability and selectivity of newly synthesized polymers can be predicted using the ML models in this work. Such models rely purely on the experimental measurement data of the gas permeability of one or more gases and are applicable against different experimental conditions. Validation of the imputation model against unseen data suggests that the gas permeability can be modelled with reasonable accuracy. Furthermore, it is possible to evaluate the gas selectivity of polymer membranes for natural gas sweetening or biogas upgrading (CO 2 /CH 4 ), carbon capture (H 2 /CO 2 and CO 2 /N 2 ), and clean fuel production (H 2 /CO 2 ).
Our results for ML models using "sparse features" suggest that permeability of He, H 2 , O 2 , N 2 and CH 4 can be quantitatively estimated using the gas permeability of O 2 and/or CO 2 . Specifically, the ERT model Table 6 RMSE of the BLR and ERT predicted gas permeability in logarithm Barrer against the experimental reports in the test set. Each column corresponds to a completed imputation with the MICE algorithm using the permeability of only the gas in that column as input. The RMSE values in bold shows the best 'feature' in predicting the gas permeability of the corresponding 'target'.  Table 7 Accuracy, precision, and recall score for the BLR and ERT model in predicting the polymers with gas selectivity above the 2008 Robeson upper bound using only the permeability of CO 2 , the "sparse feature": the accuracy, precision and recall scores are in the range of 0-1, where the closer a number is to 1, the better the model. is robust in predicting the CO 2 /CH 4 and CO 2 /N 2 selectivity from the permeability of CO 2 . It is suggested that for cases with "dense features", where the permeability data of multiple gases is already measured, the BLR model can provide accurate imputation results to the remaining gas permeability. For cases with "sparse features", on the other hand, the ERT model is recommended for making quantitative predictions to the permeability of untested gases given that the CO 2 permeability has been measured. In summary, preliminary insight into the gas permeability of polymers can be gained at the initial stage of experimental measurements, and our model has the potential to rapidly identify polymer membranes worth further investigation for both separations of primary interest and those other than they were originally designed for. As more data points are continually added to the Polymer Gas Separation Membrane Database, particularly for rarely reported sorbents and novel polymers, this will eventually provide sufficient data for the ML prediction of further gas separation performances, such as ethylene, ethane, propylene, propane, and CF 4 , based only upon initial measurements of CO 2 and O 2 . In addition, as larger experimental datasets become available, it would be possible to develop additional ML models using data from different groups of polymers. For example, ML imputation models for rubbery polymers and glassy polymers, where gas transport is dominated by solubility-selectivity and size-selectivity can be developed independently with sufficient data for both groups of polymers. This would be of significant advantage to researchers in vastly accelerating the assessment of new polymer membranes, at much lower experimental cost. We strongly encourage researchers to report all measured permeability data for membranes in their papers and to upload these to the Gas Separation Membrane Database, with this open data effort having a universal benefit for the polymer membrane community.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.