1 Introduction

Perovskite solar cells are currently one of the best alternatives to replace silicon solar cells due to their high absorption coefficient, low cost, and ease of manufacture [1, 2]. As a result, there are numerous publications attempting to cover all the different issues associated with these cells.

However, the performance of Perovskite solar cells (PSC) is highly sensitive to the physicochemical properties of the Perovskite material. These properties include the crystal structure, composition, and morphology of the Perovskite film. The information about these considerations (in real and not simulated conditions) can be extracted from scientific articles to be later on stored into data sets, as already done in [3,4,5,6]. Then, such information is codified into variables called descriptors or characteristics. This information is of considerable importance for solving questions of interest in the area of Perovskite materials science, e.g., to determine those most relevant descriptors and, therefore, those most influential synthesis conditions from the point of view of solar cell performance.

It is important to take into account that the traditional way to develop new materials is usually based on trial and error, which is time-consuming and expensive [7]; thus, knowing which variables are the most relevant could have a significant impact. In fact, researchers can develop more efficient solar cells by understanding how different processing techniques, parameters, and elements affect its performance [8,9,10], with consequent savings in time and resources during perovskite cell synthesis process.

On the other hand, interest about machine learning has recently increased among materials science researchers [11,12,13,14], and from machine learning perspective, knowing those most relevant descriptors (those most statistically related to performance) makes it possible to build more precise and accurate models [14, 15] by reducing complexity and avoiding overfitting of the same model [16]. In addition, before model construction, it is important to identify those key features closely related to the target properties in order to obtain simpler and explainable models.

There are several methods for relevant feature selection. A widely used set of methods in machine learning, called wrapper methods, measure the relevance of descriptors by using the predictive power of a model fitted to the available data [17]. In addition, they can find relevant subsets of features by estimating the predictive power of each of those possible subsets. These methods are usually computationally expensive, and the returned result is only valid from the point of that particular model used [18]. On the other hand, there are the filter methods, which tend to be faster and more efficient from the computational point of view; the results are easier to interpret; and the provided results are unique in the sense that they are independent of any model because they do not adjust any model. These methods are typically based on statistical association measures between random variables.

Among those statistical association measures, Pearson’s correlation coefficient offers simplicity and ease of interpretation, but it assumes that the relationship between the involved random variables is linear, which may not be true in problems with such as complex relationships as in materials science. Even so, Pearson coefficient was utilized in [19] as a filter method to select input descriptors for machine learning algorithms, in [20, 21] to quantify the linear correlation between synthesis descriptors Perovskite solar cells, in [9, 22] to determine the quality of the predictions over five machine learning algorithms. But this metrics only can show the variables that have a linear correlation and ignoring others. On the other hand, there are rank correlation measures that work for non-linear phenomena, but only if its underlying relationship is monotonic type [23]. In contrast, there are mutual information measures that are able to capture general non-linear dependencies between solar cell descriptors, which is more appropriate. Furthermore, mutual information is relatively resistant to noise and outliers, and no assumptions are required about the probability distributions of the involved variables. These properties make the MI an adequate measure to measure the statistical association and, therefore, the relevance of the descriptors with respect to important variables associated with the performance of Perovskite solar cells.

Identifying relevant descriptors for Perovskite solar cell synthesis is crucial for advancing the field. Identifying the key factors that influence performance can streamline the development process, leading to more efficient and cost-effective solar cell production [24]. This knowledge could reduce the need for time-consuming and resource-intensive trial-and-error methods and allow for targeted experimentation and optimization. Scientists can accelerate progress towards high performance and stable Perovskite solar cells by understanding the meaning of these descriptors.

In the present work, it is proposed to use the measure of mutual information in order to quantify the amount of information contained in the descriptive variables of the synthesis process with respect to physicochemical properties of Perovskite solar cells. These properties measure the Perovskite solar cell performance and correspond to Open Circuit Voltage (Voc), Short Circuit Current Density (Jsc), Fill Factor (FF), and Power Conversion Efficiency (PCE).

2 Method

2.1 Data

The data used in present study were taken from the dataset published in [25], which consists of 43, 239 records with 411 descriptors: 262 inputs (characteristics) that describe the synthesis process of the Perovskite solar cell and 149 outputs including the performance values \(V_{oc}, \,\, J_{sc},\,\, FF,\,\, PCE \). These data were extracted by manual review from 16, 000 scientific articles published since 2008 (the first studies) to 2020. The descriptors could be Boolean, categorical, or numeric (integer or float).

We opted for analyzing the performance parameters \(V_{OC},\) \(J_{SC}, FF\), and PCE. During the pre-processing stage, those input variables with zero variance were discarded because they do not provide any information. Additionally, those categorical variables that presented more than 100 categories were discarded due to the possibility of increased uncertainty in the mutual information estimation. Finally, those observations related to Perovskites of more than one layer were removed in order to reduce the complexity of the analysis.

Out of total of 49 variables resulting from the pre-processing, 9 are categorical, 23 are numeric, and 17 are Boolean. Categorical and Boolean variables were converted to numeric variables using the LabelEncoder tool from the sklearn Python library. On the other hand, the data was encoded in order to represent the Perovskite material in terms of the proportion of elements of Perovskite layer: MA, MethylAmmonium; FA, formaldehyde; Cs, Cesium; Pb, Lead; Sn, Tin; I, Iodine; Br, Bromine; Cl, Chlorine. This representation is easier to interpret. In addition, three variables were created to represent the A, B, and X ions of the Perovskite structure, where \(A = MA - FA -Cs\), \(B = Pb - Sn\), and \(X = I - Br - Cl\).

2.2 Mutual information

The mutual information I(XY) is viewed as a measure of statistical dependence between the two random variables X and Y. It is symmetric in X and Y, that is, \(I(X, Y) = I(Y, X)\); it is non-negative \(I(X, Y) \ge 0\); and it is equal to zero if X and Y are independent random variables. The MI between two random variables x and y, with joint density \(f_{X, Y}(x, y)\), is defined as [26],

$$\begin{aligned} I(x, y) = \int \int f_{X, Y}(x, y) \log \frac{f_{X, Y}(x, y)}{f_X(x) f_Y(y)} \, dx dy \end{aligned}$$
(1)

Mutual information (MI) can also be expressed in terms of Entropy \(H(\cdot )\), which is a measure of uncertainty of random variables. I(XY) is defined as the reduction in uncertainty of a random variable due to another random variable. In particular, \(I(X, Y) = H(X) - H(X \mid Y)\) is the reduction of the uncertainty of X due to the knowledge of (Y), and \(I(X, Y) = H(Y) - H(Y \mid X)\) is also the reduction in the uncertainty of Y due to the knowledge of X. In addition, it can be defined as \(I(X, Y) = H(X) + H(Y) - H(X, Y)\).

\(H(\cdot )\) can also be viewed as the amount of information, on the average, required to describe that random variable. For the case of a continuous random variable, the term differential entropy is typically used instead of entropy because not all the properties of discrete mutual information are the same for continuous mutual information. The differential entropy H(X) of a continuous random variable X with density \(f_X(x)\) is defined as [26],

$$\begin{aligned} H(x) = - \int f_X(x) \log f_X(x) dx \end{aligned}$$
(2)

Although mutual information is able to detect and quantify non-linear relationships between random variables, the interpretation of the quantified value, unlike the Pearson correlation \(\rho \), is less intuitive. Pearson’s correlation provides standardized values between -1 and 1 that indicates the level of the type of relationship. In contrast, MI gives only positive values, and they are not standardized.

A transformation of the MI value is proposed in [27], called informational correlation coefficient (ICC), which provides a standardized version (zero as the minimum value and one as the maximum value) that allows comparisons with the Pearson’s correlation \(\rho \). Assuming we have a bivariate normal distribution, ICC would be equal to \(\rho \). Recently, in [28], it was proposed a modified version of \(\rho _{ICC}\) denoted as \(\rho _{MICC}\) (informational correlation coefficient, MICC) in order to improve the performance of ICC and to reduce its bias. This transformation is denoted as

$$\begin{aligned} \rho _{MICC}(x, y) = \sqrt{ 1 - \frac{2}{\mathcal {W}(2 \exp {2(1 + I(x, y))})} } \end{aligned}$$
(3)

where \(\mathcal {W}(\cdot )\) is the Lambert’s function and I(xy) is the estimated mutual information value. As a consequence, an MI value of 0.2 would be comparable to a Pearson’s correlation value of 0.34.

MI is useful to determine those descriptors with the highest statistical association with respect to performance parameters. However, due to the interaction between input descriptors, the MI by itself still does not answer the question about which is the best set of descriptors that as a whole provides the highest information [29]. Conditional mutual information, as a concept, could be used to deal with that question. It is defined as the reduction in the uncertainty of X due to knowledge of Y when Z is given or provided [30]:

$$\begin{aligned} I(X, Y\mid Z)= & {} H(X, Z) + H(Y, Z) \nonumber \\{} & {} - H(X, Y, Z) - H(Z) \end{aligned}$$
(4)
$$\begin{aligned} I(X, Y\mid Z)= & {} \int \int \int f_{X, Y, X}(x, y, z) \nonumber \\{} & {} \log \left[ \frac{ f_{X, Y\mid Z}(x, y\mid z) }{ f_{X\mid Z}(x\mid z)\cdot f_{Y\mid Z}(y\mid z) } \right] dx dy dz \end{aligned}$$
(5)

2.3 Mutual information estimation

As observed in Eq. (1), the MI calculation is straightforward if the underlying joint probability distribution \(f_{X, Y}(x, y)\) is already known; however, it is typically unknown, and our knowledge of the distribution comes from the data itself. For the case of discrete variables, the estimation of the joint probability density function from the data is a straightforward task; however, this is not the case for continuous type variables. In these cases, non-parametric methods are required. They make use of the geometry of the underlying sample to estimate the local probability density function \(f(x_i, y_i)\) from the data \((x_i, y_i)\) [31]. The most popular method for estimating (MI) is by using the non-parametric estimator introduced in [32], which estimate MI from k-nearest neighbour statistics. The k-nearest neighbour estimator is a non-parametric method that estimates the density of data points in the feature space, and uses this information to compute mutual information. For the case in which there are both discrete and continuous variables, improved versions have been proposed, such as the one proposed in [33].

We used \(scikit-learn\) python library to estimate MI, which is based on k-nearest neighbour methods shown in [32] and in [33]. Although this tool has the particularity that it can work with all the input variables at the same time, it was decided to carry out the estimation variable by variable because the high number of missing values on the dataset reported in [4] in the input variables causes a reduction in the amount of available complete data, which leads to the problem of the curse of dimensionality.

Fig. 1
figure 1

Mutual information values for different performance parameters

Fig. 2
figure 2

Thickness Perovskite vs PCE

We opted for performing cross-validation and bootstrapping procedures to estimate a MI value with less uncertainty, and it also provides the standard error associated to the estimated MI value. In particular, 10-fold cross-validation is performed forming a vector of \(10\times 1\) MI estimates. The average of this vector is reported as one instance of a 10-times bootstrapping process, where the average of these ten values is reported as the final MI estimation and its standard deviation corresponds to the standard error of the MI estimation.

3 Results and discussion

MI estimates, in respect to each of the performance variables, are shown in Fig. 1. It shows those 20 most relevant variables among the 49 that were included in this. In general, it is observed that ion X is the factor that most impacts the performance of the solar cell. On the other hand, variables such as Band Gap, Perovskite layer thickness, and A and B ions are also important.

Iodine concentration consistently appears as the most relevant from the perspective of PCE, \(V_{oc}\), \(J_{sc}\), and FF, as observed in Fig. 1. The presence of iodine in the absorber layer intervenes in the bandgap adjustment, thus improving the Voc [34]. Moreover, it helps to obtain films with larger grain sizes and fewer defects [35, 36]. On the other hand, bromine concentration is another feature with the highest MI value. It shows the importance of the cation X in the perovskite structure, and it is used for improving the solar cell performance and for reducing the effects of iodine in the cells.

MA, FA, and Cs concentrations are also variables with remarkable relevance. This result seems to be in agreement with [20], which conclude that \(A-\)site cations have the most significant influence on PCE. In that same work, regression techniques such XGBoost were used to determine those most relevant descriptors. In the present work, although MA is not the one that contributes the most in terms of information, it appears among the most important. In [20], by performing a Pearson correlation matrix, considerable correlation between PCE and A-site cations is observed. On the other hand, Eg is relevant in respect to \(PCE, V_{oc}\), and \(J_{sc}\). It is important to take into account the intrinsic relationship between Eg and \(V_{oc}\).

A matrix of Pearson correlation values was obtained in [20] in order to detect interactions between descriptors. Similarly, but for the case of non-linear relationships, in the present work, we obtained a matrix of statical associations between descriptors (see Figure S1 in the supporting information). In [20], as well as in present work, considerable correlations between PCE and \(A-\)site cations are observed.

Results about correlations with Tperovskite are not reported in [20]. In particular, when we estimate the Pearson correlation coefficient between Tperovskite and PCE, we obtain a value of \(-0.0005\), indicating that there is no linear relationship between the two variables, but using the mutual information, a value of 0.07 is obtained, indicating that a relationship does exist. A scatter plot between Tperoskite and PCE is shown in Fig. 2.

Figure S1 (including in the supporting information) shows that there are interactions between descriptors. For example, there is a high relationship between FA (formaldehyde ratio) and MA (MethylAmonium ratio), suggesting that only one of the two should be included in a feature set. The same is true for the case of iodine content I and bromine content Br. Eg and Tperovskite show a considerable relationship. In this scenario, it is appropriate to apply feature selection methods based on partial information measures, in particular, conditional mutual information.

Fig. 3
figure 3

Perovskite storage relative humidity vs performance parameters

Regarding CMI (conditional mutual information), the results for the first order case are shown in Table 1. According to these results, if we take Tperovskite as the best variable (as shown in Fig. 1), then the variable that makes it the best team, that provides more additional information to that already provided by Tperovskite, is I.

Table 1 Conditional mutual information for the \(10^{th}\) important features

It is important to clarify that several descriptors contained in the data set were not taken into account due to the amount of data available. That is, those variables with little data were discarded in order to provide a more reliable estimate of the mutual information. Although the dataset consists of more than 40, 000 observations, it is plagued by missing data. In particular, taking the relative humidity versus PCE variable (see Fig. 3) yields only 67 observations out of 42, 000. The graph shows lower values of PCE for very low and high values of relative humidity. The best PCE values are for relative humidity values between 30 and \(40\%\). In other words, it should be necessary to use statistical association measures that are adjusted to detecting non-liner relationships in case of missing data.

4 Conclusions and future work

We introduce a method that quantifies the degree of statistical association between descriptors of Perovskite solar cells. It is able to measure its statistical association even in cases of non-linear relationships between descriptors; moreover, since we do not use any model, this estimation does not depend on any model either, thus achieving a general quantification of input features relevance. With this study, we have found that ion X is the factor that most impacts the performance of the solar cell. On the other hand, variables such as Band Gap, Perovskite layer thickness, and A and B ions are also important.

Regarding future work, due to the amount of missing data and the curse of dimensionality, it is difficult to estimate the joint mutual information between sets of input descriptors and performance measures in order to establish the set of optimal features. It is important to take into account that mutual information estimation implies N-dimensional probability density function estimation procedures. On the other hand, by using feature selection by means of adjustable models (wrapper methods), we would also experiment problems. As the dimension of the model increases, the number of available complete observations decreases, thus having a less number of observations as we include input descriptors. We would be discarding information as we increase the model complexity. Therefore, it would be necessary to carry out adequate feature selection methods for missing data problems. Another strategy is to implement techniques such as MICE (Multiple Imputation for Chained Equations) to estimate missing values before estimations.