Chemometrics: Theory and Application

This chapter aims to present a chemometrics as important area in chemistry to be able to help work with many among of data obtained in analysis. The term chemometrics was introduced in initial 70th years by Svant Wold (Swede) and Bruce Kowalski (USA). According International Chemometrics Society, founded in 1974, the accept definition to chemometrics is (i) the chemical discipline that uses mathematical and statistical methods to design or select optimal measurement procedures and experiments (ii) to provide maximum chemical information by analyzing chemical data [1]. When the study involving many variable became the study in a multivariate analysis, so it is necessary to building a typical matrix and is normal to do a pre-processing. Pre-processing is a procedure to adjust the different factors with different units in values than allow give for each factor the same change to contribute to the model. After, next step is usually the Pattern Recognition method, to find any similarity in your data. In This method is common using the unsupervised group where there are the HCA and PCA analysis and the supervised group where there is the KNN. The HCA analysis (Hierarchical Cluster Analysis) is used to examine the distance among the samples in two dimensional plot (dendogram) and cluster samples with similarity. (Figure 1). Now PCA analysis (Principal Component analysis) is used to try decrease the size data set, without lost information about samples (Figure 2) and KNN used to classify samples using cluster previously know [2].


Introduction
This chapter aims to present a chemometrics as important area in chemistry to be able to help work with many among of data obtained in analysis. The term chemometrics was introduced in initial 70th years by Svant Wold (Swede) and Bruce Kowalski (USA). According International Chemometrics Society, founded in 1974, the accept definition to chemometrics is (i) the chemical discipline that uses mathematical and statistical methods to design or select optimal measurement procedures and experiments (ii) to provide maximum chemical information by analyzing chemical data [1]. When the study involving many variable became the study in a multivariate analysis, so it is necessary to building a typical matrix and is normal to do a pre-processing. Pre-processing is a procedure to adjust the different factors with different units in values than allow give for each factor the same change to contribute to the model. After, next step is usually the Pattern Recognition method, to find any similarity in your data. In This method is common using the unsupervised group where there are the HCA and PCA analysis and the supervised group where there is the KNN. The HCA analysis (Hierarchical Cluster Analysis) is used to examine the distance among the samples in two dimensional plot (dendogram) and cluster samples with similarity. (Figure 1). Now PCA analysis (Principal Component analysis) is used to try decrease the size data set, without lost information about samples ( Figure 2) and KNN used to classify samples using cluster previously know [2].

Pattern recognition
In analytical chemistry when we have the data set, it is important find similarities and differences between samples based on measurements. For this is necessary to use methods according with information about the samples. And can be: Unsupervised (HCA and PCA) and Supervised methods (KNN)

Unsupervised methods
In this group there are two methods: Hierarchical Cluster Analysis (HCA) and Principal Components Analysis (PCA), and the goal is to evaluate if there is any clustering in data set without using the class about samples.

Hierarchical Cluster Analysis (HCA)
The Hierarchical Cluster Analysis is a technique to evaluate the distance between de samples and group in a plot calling dendogram. Theses distance can be calculated utilizing different methods as Euclidean or Mahalanobis or Manhattan distance, for example. For the Euclidean distance is using the equation 1, for Mahalanobis distance is using the equation 2 and for Manhattan distance is using equation 3: Where: Xn and Yn are the coordinates of sample X and Y in the n th dimension of row space.
Where: Xi and Yj are column vectors for objects i and j, respective and C is the covariance matrix.
Where: Xi and Yi are vectors.
When performed the estimate for distance, so is possible plot the dendogram. A general dendogram is showing below (Figure 3). In this dendogram is possible to see the samples (letters) and the distances (numbers). Samples belonging to clusters A, has a distance of 0,2 from one another. Same time the sample B has a distance 0,5 from cluster A. The value of distance can change according with the distance used to calculate.

Principal Components Analysis (PCA)
The Principal Components Analysis (PCA) has the goal available the distances between the points using few axes in the row plot. In a matrix, each row is the point in the graphic below ( Figure 2). So the aim is study the relationship between these samples to find the similarity and differences. In this general example are using two principal components (PC1 and PC2). The first PC (PC1) describes the major points in the graph and the maximum amount of variance, while the PC2 explain the remaining points. It is important to know that the sum of percentage described by PC´s must be close 100%. Another propriety of PC´s is about de position. The PC´s are always perpendiculars one with another.
The PCA technical can be used to define which variables are more important in a process. For this analysis is necessary use the factors (column in the matrix) and objects (row in the matrix). When the aim is to determine which variable are more important for the process is used loading and when want studying the relationship between objects is used scores

Supervised methods
The Supervised methods are using when want to construct a model using the class membership for future samples. In this group, KNN is a technical widely used when the goal is this.

K-Nearrest Neighbor (K-NN)
The KNN technical allows use the samples or clusters to identify another samples or clusters. For this is necessary to calculate the distances between them, using a Euclidean or Mahalanobis or Manhattan distance, for example. The minimum distance is calculated and the object is assigned to the corresponding class. A classification is dependent on the number of objects in each class.

The QSAR principle: Hansch analysis
The development of new drugs is a continuous challenge, before uncountable diseases the lack an adequate pharmaceutical approach. The modern medicinal chemists concern specially with methods based upon rational and quantitative procedures, aiming to focus on potentially efficient candidates. In that context, the use of chemometric methods is very important, in quantitative structure-activity relationship (QSAR) studies, and it presupposes that the biological activity (BA), measured through a biological response (BR), keeps a relationship with chemical structure (CS): The first attempt to quantitatively relate chemical structure to chemical behavior in a series of structuraly kindred compounds remounts to 1940´s, with Hammett [3] who, studying the meta-and para-substituted benzoic acids at 25°C, stablished linear relationships between the R = X substituted benzoic acid ionization constant (KX) and the ionization constant of the non-substituted benzoic acid (R = H): The σ constant is group-specific, and represents the electronic effect (inductive and resonance type) pursuit by R group. In 1964, Corwin Hansch [4] combined the use of the electronic constants to the lipophylic parameter (π), which represents the contribution of each R group to the overall lipophylicity: where PX is the X-substituted compound octanol-water partition coefficient, and PH, the partition coefficient for a non-substituted compound. Thus, a QSAR equation evolves some kind of RB, for example, the negative logarithm of the minimal inhibitory concentration (MIC) for am antimicrobial compounds series (-log(MIC)), and the electronic (σ) and lipophylic effect (π) of the R groups, the makes distinction among the several series representatives, can be expressed as where a, b and c are the multiple regression coefficients.
The Hansch´s hypothesis that RB may be related to specific physico-chemical to each substituent present in the basic skeleton in a congener series of similar BA led to the proposition of numerous descriptors, of different kinds, useful to the identification of the principal effects that show up in drug action.

Physico-chemical descriptors
There are several physico-chemical descriptors, useful in QSAR studies that can be divided in categories: constitutional, topological, stereochemical and electronic ones, beside the so called indicator variables.

Constitutional descriptors
This kind of descriptor is related to the presence of structural characteristics that can affect the BA, such as: amount of unsaturated bonds, amount of hydrogen-bond donors, average ring size, etc.

Topological descriptors
These are descriptors that represent shape and connectivity, such as: ramifications, spacing groups, unsaturations, etc. The Kier [5] and Wiener [6] descriptors are typical.

Steric (or stereochemical) descriptors
Steric descriptors exist to describe effects related to the size of chemical groups and hindrance behavior. Taft steric descriptor, Es, [7] is a common example.

Eletronic descriptors
These variables are related to molecular electronic densities, and are used to be calculated by quantum methods. One can mention as examples: dipole moments, atomic partial charges, highest occupied molecular orbital energy (HOMO) and lowest unoccupied molecular orbital energy (LUMO).

Indicator variable and Taylor analysis
Indicator variables represent a useful way to convert a qualitative information into quantitative once, just as the occurrence of some kind of structural feature -setting 1 when this feature is present, and 0 otherwise. The Taylor QSAR [8] approach employs indicator variables.

Chemometric methods applied to drug design
Chemometric statistical methods find in QSAR a large application field, considering that the multivariate problems are inherent to it.

Discriminatory and classificatory methods
Those methods aim the grouping and classification of compounds and variables in classes or categories that share resemblances, and are very interesting in pattern recognition situations and in dimensionality reduction of complex systems.

Principal Component Analysis (PCA)
Principal component (PCs) methods aim to combine correlated variables, projecting them in a new coordinate system, so that fewer variables are obtains, without any intercorrelation. The former coordinates are projects in a new axis system, in which the system variability is maximum along PC1, decreasing along the other axises (PC2, PC3...), all of the orthogonal each other, what allows one to deal just with the first components (usually PC1, PC2 and PC3). Thus, from a multi-variable universe, commonly multicolinear, one can obtain a simpler system with almost the same amount of information. Naming X the data matrix, with I×J dimension (I molecules and J descritors), a PCA generates two matrices, T e L, so that T X = TL (8) The matrix T is of scores, and represents the position of the compounds in a a novel coordinate system in which the components are its axises, and L is the loading matrix.
Plotting the PCs instead of the original descriptors, one obtains groups governed by the similarities among the data.

Hierarchical Cluster Analysis (HCA)
This analysis is also useful to the classification of compounds, permitting visually distinguish the patterns and cluster. The plot resembling a tree, called dendogram, presents similar compounds at the same branches. Those branches are plotted based upon a similarity matrix, S, and each component of it is given by the similarity index between two samples k and l, Skl: In this expression, dkl is the Euclidian distance between k and l, and dmax, the maximum distance. Ferreira [9] describes a PCA/HCA analysis for a 25-compound series of 1,4-naphtoquinones with antitumour activity. Using electronic descriptors, it was possible to distinguish active from inactive compounds (Figure 4). The loadings values indicate that the presence of high-density groups in side chain and terminal positions favours activity. The same profile arise from the dendogram analysis.

Multivariate regression
To construct a QSAR equation (Eq. 1), it is necessary to adopt some kind of multivariate fitting method in order to correlate the descriptors with the BR. The main methods are: multilinear regression (MLR), principal component regression (PCR) and partial-least squares (PLS).

Multilinear regression (MLR)
The objective of this method is obtaining a relationship among a number of descriptors limited to 1/5 of the number of compounds and the BR, as an equation of the form: in which i are the regression coefficients, Di are the descriptors, εi, the coefficients confidence interval and ε, the independent term. The model statistical validation is very important, and it requires the consistency in the Di descriptors unit, as well as in values magnitude (necessarily). Statistical parameter like the fitting coefficient (r), the sample standard deviation (s), the cross-validation coefficient (q 2 ) and the Fischer test (F) are used in this task. The MLR is quite sensitive to multicollinearity: variables intercorrelated (tipically, com r 2 > 0.6) must not be used together. This is a common problem in multi-descriptor system that may be dealed with other regression methods.

Principal component regression (PCR)
In order to avoid multicollinearity, it is possible to make the regression, not with the descriptors themselves, but with their principal components (PCs) generated in a PCA treatment. The main advantage of this approach is the assurance that every variable are independent and no n-correlated, despite it is necessary to analyze the loading matrix (L). In this kind of regression, the variables are defined to maximize the descriptor matrix variance, without force a correlation with the BR

Partial least square (PLS)
Similarly to PCR, the PCs are employed, but in this case, the BR matrix has maximum variability, so that each loading matrix component (L) is a good predictor for each BR matrix component. This is the most used regression method, and it is adequate for dealing with 3D-QSAR problems, in which a set of compounds preciously aligned is put within a grid of interaction points with a molecular probe. Each point energy is a variable in the QSAR equation, which are by their turn corrlated with the BR to achieve a tridimensional profile of the critical sites that favours or disfavours the interaction with a hypothetical biological receptor.

Design of experiments
The exploration for new sources of energy such as biodiesel is of great importance today as well as their production processes. The factorial design is an important tool to reduce the search time, waste of reagents and hence operating costs [10]. A factorial design is performed with the interest to determine the experimental variables and interactions between variables that have significant influence on the different responses of interest [11]. After selecting the significant variables, we must evaluate the experimental methodology and the influence of a particular variable on the yield of the reaction, a statistical experimental design, full factorial type, in which the independent variables are: the nature and concentration of catalyst temperature and the molar ratio between alcohol and oil and the dependent variable is the yield of esters produced. The variables that were not selected must be fixed throughout the experiment [12]. In a subsequent step must be chosen which planning used for estimating the effect (the effect) of the different variables results in a reduced number of conducting experiments. In the screening study the interactions between the variables (main interactions) and second order, usually obtained by full or fractional factorial designs. In the experiments are evaluated best experimental conditions, as well as their simultaneous effects that influence the yield of the reaction are therefore extremely important for understanding the behavior of the system [13]. The values of "p" and greater than or equal to 0.05 indicate that the factors: variable (1), variable (2), variable (3), variable (4) and the interactions of the variables are statistically significant at 95% reliable, since they are greater than 0.05. These parameters were evaluated at a low level (-1) and high (+1) are significant to the process of positive or negative manner. The Figure. 6 shows the profile of the Pareto chart [7] The analysis parameters obtained by means of multivariate optimization consists in choosing the conditions for preliminary assessment of experimental variables (fractional factorial design) followed by a response surface methodology (central composite design) made from the screening of the variables that may affect the synthesis of biodiesel. Generated model and the set of significant effects can evaluate through the study of response surface methodology, as shown in Figure 7 and 8, and their interference in the response, ie the yield of the reaction, in which the dark area demonstrates the conditions that process has higher yield. Thus, the statistical analysis shown to be an important tool to evaluate, select and propose new technological routes, either through raw materials and / or process evaluation of the parameters that most influence the transesterification reaction to obtain for biofuels.

Conclusion of chapter
This chapter had as aim to show the versatility tools chemometrics in several areas. Was showed application chemometrics theory in drug design, natural products chemistry but it is not limited in theses area. Well, we hope to have expanded the range of chemometrics