STRUCTURE-AFFINITY MODELING OF AZO DYE ADSORPTION ON CELLULOSE FIBRE BY MLR

Quantitative structure-affinity relationships were applied by multiple linear regression (MLR) analysis for a series of 21 monoazo dyes. Calculated 0D, 1D and 2D structural dye features were correlated to their affinity for cellulose. Variable selection was performed by the genetic algorithm. Good correlations with dye affinity and models with predictive power were obtained. Electrostatic interactions are favorable and hydrophobic disfavorable for dye binding on cellulose. INTRODUCTION Several computational methods have been employed in the study of textile adsorption on cellulose fibre [1, 2]. The classical QSAR methods rely principally on the mathematical technique of multiple linear regression (MLR). This means an easy interpretation of the results, especially when the fibre affinities of the dye molecules are related to simple and clearly defined physico-chemical parameters, but implies some risks of chance correlation. This disadvantage can be improved by the introduction of several criteria during the variable selection. The number of parameters potentially important for the dye fibre interaction can be large and this leads to the use of multivariate statistical methods, like principal component analysis, principal component regression analysis or PLS (projection in latent structures). These methods successfully handle large matrices of predictor variables, although sometimes with disadvantage of clarity as well as of physical and chemical interpretation. This paper presents a quantitative structure-affinity relationships study for a series of azo dyes by the multiple linear regression (MLR) method. Structural dye features obtained by molecular modeling techniques were correlated to their affinity for cellulose. Variable selection was performed by the genetic algorithm and several MLR models were obtained. They give information on the dye adsorption mechanism on fibre. METHODS AND MATERIALS Molecular descriptors A series of 21 dyes was considered, having as dependent variable the affinity for cellulose fibre taken from literature [3-5] (see table 1). The molecular dye structures were built by the ChemOffice package [6] and energetically optimized by molecular mechanics calculations. The optimized structures were further used to derive structural dye descriptors. Several types of 0D, 1D and 2D descriptors were calculated by the Dragon software [7] : constitutional (e.g. MW-molecular weight, AMW-average molecular weight, Mp-mean atomic polarizability (scaled on Carbon atom), Me-mean atomic Sanderson electronegativity (scaled on Carbon atom), Ss-sum of Kier-Hall electrotopological states, nSnumber of Sulfur atoms, SCBO-sum of conventional bond orders (H-depleted)), functional groups counts (like: nCpnumber of terminal primary C(sp3) atoms, nHBonds-number of intramolecular H-bonds (with nitrogen, oxygen, fluorine), nThiazoles-number of Thiazoles, nSO2OHnumber of sulfonic (thio-/dithio-) acids) and molecular properties (like: ALOGP-Ghose-Crippen octanol-water partition coefficient, TPSA(Tot)-topological polar surface area using nitrogen, oxygen, sulphur, phosphor polar contributions). Descriptors included in the final MLR models are presented in table 2. Multiple Linear Regression (MLR) Multiple linear regression relates one experimental variable yk to one or several structural variables xi by the equation [8]: ∑ + ⋅ + = i k ik i o k e x b b y (1) where b represents regression coefficients and e the deviations and residuals. MLR calculations were performed by the STATISTICA package [9]. Table 1. The studied compounds and their affinities (A) No. Compound structure X N N Y A (kJ/mole) No. Compound structure


INTRODUCTION
Several computational methods have been employed in the study of textile adsorption on cellulose fibre [1,2].
The classical QSAR methods rely principally on the mathematical technique of multiple linear regression (MLR).This means an easy interpretation of the results, especially when the fibre affinities of the dye molecules are related to simple and clearly defined physico-chemical parameters, but implies some risks of chance correlation.This disadvantage can be improved by the introduction of several criteria during the variable selection.The number of parameters potentially important for the dye fibre interaction can be large and this leads to the use of multivariate statistical methods, like principal component analysis, principal component regression analysis or PLS (projection in latent structures).These methods successfully handle large matrices of predictor variables, although sometimes with disadvantage of clarity as well as of physical and chemical interpretation.
This paper presents a quantitative structure-affinity relationships study for a series of azo dyes by the multiple linear regression (MLR) method.Structural dye features obtained by molecular modeling techniques were correlated to their affinity for cellulose.Variable selection was performed by the genetic algorithm and several MLR models were obtained.They give information on the dye adsorption mechanism on fibre.

Molecular descriptors
A series of 21 dyes was considered, having as dependent variable the affinity for cellulose fibre taken from literature [3][4][5] (see table 1).
The molecular dye structures were built by the ChemOffice package [6] and energetically optimized by molecular mechanics calculations.The optimized structures were further used to derive structural dye descriptors.Several types of 0D, 1D and 2D descriptors were calculated by the Dragon software [7] : constitutional (e.g.MW-molecular weight, AMW-average molecular weight, Mp-mean atomic polarizability (scaled on Carbon atom), Me-mean atomic Sanderson electronegativity (scaled on Carbon atom), Ss-sum of Kier-Hall electrotopological states, nSnumber of Sulfur atoms, SCBO-sum of conventional bond orders (H-depleted)), functional groups counts (like: nCp-number of terminal primary C(sp3) atoms, nHBonds-number of intramolecular H-bonds (with nitrogen, oxygen, fluorine), nThiazoles-number of Thiazoles, nSO2OH-number of sulfonic (thio-/dithio-) acids) and molecular properties (like: ALOGP-Ghose-Crippen octanol-water partition coefficient, TPSA(Tot)-topological polar surface area using nitrogen, oxygen, sulphur, phosphor polar contributions).Descriptors included in the final MLR models are presented in table 2.

Multiple Linear Regression (MLR)
Multiple linear regression relates one experimental variable y k to one or several structural variables x i by the equation [8]:

Model validation
In order to test the predictive power of the model, the following statistical measures were used [10]: 1) correlation coefficient R between the predicted and observed activities; 2) coefficient of determination for linear regressions with intercepts set to zero, i.e.
In addition to these criteria,

RESULTS AND DISCUSSIONS
The series of 21 dyes was studied by molecular mechanics calculations and the optimized structures thus derived were used to calculate dye descriptors.1) were considered.The test set compounds were selected consulting the scores scatter plots of the first three principal components (82.1 % of the variance explained) for the principal component analysis (PCA) model constructed using the matrix of the whole set of descriptor variables for the 21 analyzed compounds.We have included in the test set one of two similar compounds (grouped together) positioned on the opposite sides of the plot origin in the four quadrants of the respective plots.PCA analysis was performed by the SIMCA-P+ software [12].
selection was carried out by the genetic algorithm included in the MobyDigs program [11], using the RQK function [13], as fitness function.Leave-one-out crossvalidation and bootstrapping techniques were used for the internal validation of the obtained MLR models.Two MLR models were found to be predictive.They are presented in Table 3. Best with dye affinity and statistical results were noticed in model 1.
The predictive power of the best MLR model was then checked by the criteria stated by A.
Tropsha et al [10] (see equations ( 2) to ( 6)).All these calculated criteria indicated a model with predictive power, respectively:  ext -external Q 2 (for the test set), Y-scrambling parameters [14] (a(r2), a(q2)), F-Fischer test, s-standard deviation Hydrogen bonds between dye and cellulose are expected to have highest contribution to the dye affinity.Dye sulfonic acid groups and dye hydrophobicity are detrimental for the dye binding.
Dye polar surface area decrease the dye affinity, being probably related to the hydrophobic interactions at the dye surface-dyebath solution interface.
regression coefficients and e the deviations and residuals.MLR calculations were performed by the STATISTICA package [9].

2 0R
(predicted versus observed activities), and 2 ' 0 R (observed versus predicted activities); 3) slopes k and k' of the above mentioned two regression lines.The following conditions should be satisfied for an acceptable predictive

2 extQ
values were calculated by the MobyDigs software[11] to test the predictive power of the model obtained from the training set compounds.The external validation technique uses a test set to perform a further check on the predictive capabilities of a model obtained from a training set and with predictive power optimized by an evaluation set.By using the selected model the values of the response for the test objects are calculated and the quality of these predictions is defined in terms of Q2  ext , which is defined as: runs over the test set objects (n ext ) and y, ∧ y and y are the experimental, predicted, respectively the average values of the training set responses.

Table 1 .
The studied compounds and their affinities (A) The descriptors used in the final MLR model are presented in table 2. MLR calculations have been performed by the STATISTICA software [9].

Table 2 .
Calculated dye descriptors: average molecular weight (AMW), number of terminal

Table 3 .
Final MLR models for the series of 21 dyes*