Construction of Analytical Curve Fit Models for Simvastatin using Ordinary and Weighted Least Squares Methods

Métodos analíticos requerem modelos adequados de ajuste de curva para expressar confiabilidade. Métodos dos mínimos quadrados ordinários ou ponderado (OLSM ou WLSM, respectivamente) foram usados para determinar o modelo matemático mais adequado à curva analítica, iniciando-se do método mais simples (linear) até o quadrático. A normalidade e a homocedasticidade dos resíduos dos modelos foram avaliadas. Curvas analíticas foram construídas pela injeção de 1, 5, 10, 15 e 20 mL de sinvastatina 40 μg mL (40, 200, 400, 600 e 800 ng) ou de 10 mL de sinvastatina 4, 20, 40, 60 e 80 μg mL, empregando cromatografia líquida de alta eficiência com detecção por arranjo de diodos (l 238 nm). Os melhores modelos foram o linear e o quadrático observados para os conjuntos de dados massas e concentrações, respectivamente. Na faixa de trabalho considerada, WLSM mostrou-se mais apropriado que OLSM. Os diferentes comportamentos apontam para a necessidade de uma escolha sensata do modelo mais adequado para expressar a curva analítica e assegurar a confiabilidade do método utilizado.


Introduction
][3][4][5][6][7][8][9] Pharmaceutical companies must comply with these requirements to ensure the efficacy, safety and quality of drugs. 10e parameters to be evaluated during a method validation depend on the assay purpose. 2,6,7,113][14][15] The analytical curve expresses the relationship between the concentrations of the analyte and the detected responses within a range as a monotonic mathematical function (linear or non-linear). 16inearity is the ability of the analytical procedure to obtain results directly proportional (or by means of well-defined mathematical transformations) to the concentrations of the analyte in a specified range. 2,6,14Until recently, the analytical curve and linearity terms have been used in a confusing way because the function that expresses the response depends on the method. 12,15,17However, when these methods cover a wide dynamical range, more complex or weighted models (quadratic, logarithmic, etc.) may be required. 12,16egardless of the curve behavior, the linear regression obtained by the ordinary least squares method (OLSM) is the statistical method most applied to analytical procedures. 18evertheless, OLSM has been indiscriminately used without evaluating the model and the assumptions related to its residuals. 19OLSM requires the treatment of the outliers by using, for instance, the Jacknife test; as well as the verification of the assumptions of normality, homoscedasticity and independency of the residuals by the Ryan-Joiner or Jarque-Berra tests, the Levene test as modified by Brown-Forsy or Cook-Weisberg, and the Durbin-Watson test, respectively. 16,18,20At least, two hypotheses should be satisfied: the normality of the response at every concentration level and the homogeneity of the variances of the responses (homoscedasticity) in the interval of the concentrations. 16When a linear model (non-weighted) is not adequate for the selected range, the function of the analytical curve should be adjusted by testing mathematical models using either transformations in the responses or the weighted least squares method (WLSM). 12,13WLSM is recommended when non-homoscedastic data are found, for instance, in a wide analytical range. 13,14Regardless of the least squares method, a minimum of five concentration levels, in triplicate, is recommended to determine the linearity function since the uncertainty varies with the number of replicates. 7,152][33][34] The adequacy of the equation that represents the analytical curve is the most important way to assure low uncertainty in UV analytical measurements. 35hus, in this work, different strategies have been employed to define the best analytical curve for SIM quantitation.Beginning from the simplest linear model that uses OLSM to more complex models that use WLSM, different injection volumes [36][37][38] or different concentrations were evaluated by HPLC-UV.

Standard solutions
SIM RS stock solution at 200 mg mL -1 was prepared by accurately weighing 20 mg of SIM in 100 mL volumetric flask, followed by dilution in methanol (50 mL), sonication for 10 min and addition of the same solvent to complete the volume.Aliquots of 0.5, 2.0, 5.0, 7.5 and 10.0 mL were transferred from a precise burette to volumetric flasks (25 mL) in order to obtain SIM standard solutions at 4, 20, 40, 60 and 80 mg mL -1 (n = 3) in methanol.All solutions were filtered before the injections (n = 3) in the chromatograph.

Liquid chromatography
Separation was performed in a RP-8 non-endcapped (LiChroCART® 250-4 LiChrospher ® 100, 5 mm, Merck Darmstadt, Germany) column maintained at 30 °C, using methanol:0.1% phosphoric acid (80:20 v/v) as mobile phase with a flow rate of 1.5 mL min -1 .The backpressure was kept about 125 bar.UV detector was set at l 238 nm.[41] Analytical curves SIM standard solutions (triplicate) either in variable injection volumes of 1, 5, 10, 15 and 20 mL from a 40 mg mL -1 solution in methanol or in five concentrations were used to build the analytical curves (two or one day) Vol. 24, No. 9, 2013   in the ranges of 40-800 ng and 4-80 mg mL -1 (independent variable, X) vs. SIM chromatographic peak areas (dependent variable, Y, expressed in mAU). 7,15atistical analysis Data analyses were performed using the statistical R environment according to its functions. 42Increasingly complex models were proposed from the simplest linear (lm) straight line using OLSM (lm(y~x)), and then, using WLSM (lm(y~x,weights)). 18The quadratic model is given as follows. ( where n T is the total number of observations used to estimate the regression coefficients b 0 , b L and b Q (for respective models), and ε is the model error.
The outliers were evaluated using the standardized (also called studentized) residuals (rstudent( )).Observations whose residues were greater than 3.0 were considered outliers and then removed (limited to 22% of the dataset) from the dataset. 18,20,43he Shapiro-Wilk (shapiro.test()) and Levene (leveneTest( )) tests were used to check the assumptions of the model error related to normality and homoscedasticity, respectively. 16,18,19In addition, the normal quantile-quantile plot (Q-Q plot, qq.norm( )) and Bartlett test (bartlett.test()) were also used to verify normality and homoscedasticity, respectively. 19,20The goodness-of-fit of the model was evaluated by the coefficient of determination (R 2 ) and the mean quadratic error of prediction (MEP), 20 and by observing the residuals vs. mass or concentration plot. 13,18he predictions for the masses (or concentrations), which are obtained from the chromatographic responses through the model equation, were evaluated using the mean relative error (MRE) defined in equation 2 by (2)   where x i is the i-th observation for the mass (or concentration) and is the prediction for x i using the model adjusted to all data, except for the i-th observation.Similar to MEP, the idea of using MRE is to have a global measure for the prediction error (in relative terms) of the model.Measurements similar to MRE are also used to evaluate the goodness-of-fit of regression models. 44he R script to adjust and check the quadratic model in equation 1 for a single day, using WLSM, is presented in Figure S1 in the Supplementary Information (SI) section.
Parallelism for SIM mass obtained in the two-day dataset was verified by the addition of two terms to equation 1 resulting in the following:

Results
The data of the analytical curves obtained by injecting variable volumes (in days 1 and 2) or a fixed volume with different concentrations (in one day) in terms of the values of mean, relative standard deviation (RSD in %) and variance (s 2 ) are shown in Table 1.The residual plots for SIM masses (ng) and SIM concentrations (mg mL -1 ) after adjusting the ordinary linear model to the dataset are depicted in Figure 1.
The regression models to determine SIM (mass or concentration) calculated by OLSM or WLSM (after outliers removal, if necessary) that showed statistically significant results are presented in Table 2. Q-Q plots for the residuals of the non-weighted and weighted models are shown in Figure 2.
A scatter plot of the variance and mean response calculated at each level of SIM mass (or concentrations) using the data in Table 1 is exhibited in Figure 3.The residual plots for SIM masses (ng) after adjusting the weighted linear model to the dataset obtained for both days and for SIM concentrations (mg mL -1 ) after adjusting the weighted quadratic model are depicted in Figure 4.

Discussion
The identification and removal of the outliers are important because they inflate the estimated variance, increase the probability of type II error (accept the null hypothesis as true when it is false), influence the significance tests for the model parameters and frequently cause violation of the assumption of constant variance error. 18The simple linear model was initially proposed for the SIM dataset assayed by HPLC-UV (Table 1).Studentized residual plots allowed the identification of one outlier (point 5 in bold, Table 1), which represented 6.6% of the dataset obtained by the injection of variable volumes of 40 mg mL -1 SIM on day 1.
The studentized residual plots in Figure 1 tended to exhibit a conical behavior for either SIM mass or concentration, as described for SIM bioanalytical procedures. 22This behavior was clearly evidenced by the increase in residual variance as the level of SIM mass (or concentration) increased, as can be seen in Table 1.Since the response of SIM mass (or concentration) was positively correlated, the residual variance increased when the response mean increased, which violated the assumption of homoscedasticity.
After the removal of outliers (if necessary), the dataset adjusted to linear models yielded adequate values of R 2 (> 0.999) and RSD (< 1.0%), greater than 0.99986 and less than 0.85%, respectively. 10,11The calculated p-values (> 0.06 and > 0.18, respectively) obtained by the Shapiro-Wilk and Levene tests did not allow identification of the problems of normality and homoscedasticity when the linear model was applied to the three datasets.However, problems were detected by the Bartlett test for the homogeneity of variances in the dataset regarding SIM mass for day 1 and SIM concentrations (Table 2).The Q-Q plot of the linear model showed points   1.
outside (in black) the confidence interval for all datasets, as in Figures 2a, 2b and 2c.Thus, the linear models could not meet the OLSM assumptions, which required other models to fit the dataset.
The quadratic model calculated by OLSM was not statistically significant (p-value ≥ 0.53) when applied to the dataset obtained either from SIM mass for day 2 (after removal of one outlier) or SIM concentrations.Although the values of R 2 (> 0.99995) and RSD (< 0.54%) were appropriate, greater than 0.999 and less than 1.0%, respectively, the quadratic model calculated by OLSM was statistically inconclusive (p-value ≥ 0.057) when applied to the dataset obtained from SIM mass for day 1 (after removal of one outlier). 10,11Furthermore, the quality of the method was inadequate due to the lack of normality and the homoscedasticity of the model errors, as verified by the p-values of the Shapiro-Wilk (0.01) and Barlett (0.03) tests, respectively.The normality problem was confirmed by a corresponding Q-Q plot, which was similar to that mentioned in Figure 2a.
Independent on the applied model (linear or quadratic) and the dataset, there was at least one problem related to the quality of the method (homoscedasticity and/or normality).These variance problems are well known requirements for not using OLSM. 18,19 possible solution to these problems is to use weights in the estimation of the model coefficients, when WLSM is advised for heteroscedasticity cases. 184][25][26] Alternatively, the variance of the responses can be leveled by transforming the response, for instance, by using the Box-Cox method. 21ince calculation of weights is a very important step for the success of the model adjustment, the relationship between the variance and mean of the responses for each SIM mass and concentration level was examined using scatter plots for all datasets.The aim was to predict an appropriate weighting factor to make residual variance more homogeneous.
In this case, the weights were defined as the inverse of the variance of the observations at the determined level of SIM mass (or concentration).Thus, observations taken at levels with larger variances have smaller weights.
Since the variance of the responses is related to their mean, one can adjust a linear model to estimate the variance of the responses through the mean of the responses.The relationship between the variances and means was clearly nonlinear (Figure 3).Residual variance seemed to be an exponential function of the mean of the responses for all dataset to a greater (day 1, SIM mass and SIM concentration) or lesser (day 2, SIM mass) degree.Therefore, the relationship between the natural logarithm of the variance and the mean would be linear, and as such, a linear regression model was adjusted to estimate the variances and weights.The dependent and independent for two days and (b) concentrations (4-80 mg mL -1 ), after adjusting the weighted linear and quadratic models, respectively.Chromatographic conditions as in Table 1.

Figure 3.
Estimates of the variance response ( ) vs. estimates of mean response ( ) at each level of SIM masses (for days 1 and 2) and SIM concentrations (single day).Chromatographic conditions as in Table 1.Vol. 24, No. 9, 2013 variables were the natural logarithm of the variance of the responses and the mean of the responses, respectively.Hence, a weight was estimated for each level of SIM mass (or concentration).These weights were assigned to the three observations of each level of SIM mass (or concentration).Therefore, in order to correct the problems of the homogeneity of variance, the reciprocal of the variance of responses was used as a weighting factor for the linear and quadratic models.It was calculated as depicted in R script (Figure S3 in the SI section).
Initially, a weighted linear model was adjusted to the three datasets and no outlier was observed by the treatment of the studentized residuals (Table 2).The removal of outliers is not an ideal condition, since it reduces the scarce degrees of freedom of the sum of squared errors even more.The residual plots adjusted by the weighted linear models did not exhibit any conical behavior; otherwise, a random distribution around the value zero without any trend for SIM concentration (data not shown) and SIM mass datasets for days 1 and 2 is exhibited (Figure 4a).All models showed more adequate values of R 2 (> 0.99991) and RSD (< 0.18) than those obtained with the respective non-weighted linear models (Table 2), thereby indicating the quality of the model. 10,11,18For all datasets, the p-values of the Shapiro-Wilk and Levene tests were greater than 0.17 and 0.88, respectively, considering the datasets obtained from SIM mass, which were also more appropriate than those observed when the linear model was adjusted using OLSM.The p-values of the Bartlett test confirmed the adequate homogeneity of the variance for all datasets.Q-Q plots adequate distribution of the residues.The results were better than those observed when the simplest linear model was adjusted to the datasets obtained from SIM mass, as in Figures 2d and 2e.
Last, the fitting of the weighted quadratic model was only statistically significant (2.48 × 10 -13 ) for the dataset obtained from SIM concentration.The residual plot showed minor discrepancy in the variances, suggesting homogeneity, as in Figure 4b.This result was confirmed by the adequate p-values obtained from the Levene (0.92) and Bartlett (0.72) tests.The best values of R 2 and RSD were observed using the weighted quadratic model (Table 2).The method also exhibited appropriate normality evaluated by the Shapiro-Wilk test and the Q-Q plot, as shown in Figure 2f.
It is worth noting that Q-Q plots for the residuals of non-weighted models revealed problems in the assumption of normality, while the Q-Q plots for the weighted models (right column) did not show any such problems.
Further, the models using WLSM showed smaller MRE values than those obtained using OLSM (Table 2).This means that the errors in the predictions of mass (or concentration) were smaller when using WLSM than when using OLSM to estimate the regression equation.
Comparing the weighted linear equations applied to SIM mass determination on days 1 and 2 by ANOVA, no statistical difference was found between the beta coefficients, which means that the parallelism was attested (Figure S2).Thus, a single weighted linear curve expressed by y = 1.885 + 2.285x with R 2 0.99996 and RSD 0.12% could be used to determine SIM mass.This dataset also showed adequate normality (Shapiro-Wilk p-value 0.59) and homoscedasticity (Levene and Bartlett p-values 0.36 and 0.81, respectively).
In this study, the use of mass or concentration in the construction of analytical curves seemed to influence the model, as long as linear and quadratic fittings were, respectively, observed.Independent on the mass or concentration, it was observed that WLSM was adequate for curve fitting, unlike OLSM, due to the wide employed interval (which varied between 10 and 200% of the analytical range) as it had been intentionally designed for future drug dissolution studies. 13,14Furthermore, the suitability of WLSM was previously suggested by the removal of less outlier observations in comparison to OLSM.The difference in the models can be explained by the error associated with the standard dilutions.This error was more relevant when working with distinct concentrations, as in the present case expressed by quadratic fitting.Under variable volumes, the dilution error was less significant as it was substituted by the injection error, exemplified in the case of linear fitting.

Conclusions
As long as the good laboratory practice and validation criteria are fulfilled, the most adequate model to express the relationship between mass or concentration and its response should be selected considering the assumptions of the least squares method related to the variance of the residues.The possibility of the mass or concentration influencing models differently needs to be noted.Generally, the simplest linear model adjusted by OLSM should preferably be adopted.However, when OLSM does not show adequacy to express the relationship between variables, the source of heteroscedasticity should be investigated before leveling the response variances by WLSM.Then, a question arises as to what can be an alternative for the adjustment of the model.In this case, the choice of an appropriate weighting factor is a better alternative than using response transformations (inversion, square root, square inversion, etc.) since they can make the model more difficult to both interpret and apply.

( 3 )
where D is an indicator for the day of observation (if observation is done on day 1, D = 0; if done on day 2, D = 1).Taking b D = 0 and b DL = 0 in the linear model means that the curves are the same for both days.If only b DL = 0, the curves are parallel but they have different intercepts.For the quadratic model, the curves are the same if b D = b DL = b DQ = 0 and are parallel if b DL = b DQ = 0.The R script used to adjust and check the linear version of equation 3 with WLSM is depicted in Figure S2 (in the SI section).The results were statistically significant if correspondent p-values were less than 0.05.The weights were calculated according to the algorithm shown in Figure S3 (in the SI section).

Figure 2 .
Figure2.SIM residual Q-Q plots by HPLC-UV injection of variable volumes (n = 5) on day 1 and on day 2, and for fixed volume (10 µL) of five concentrations on day 1, obtained from OLSM-adjusted linear models (a, b and c); and obtained from WLSM-linear (d and e) and WLSM-quadratic (f) models.Points outside of the confidence limits are in black.Chromatographic conditions as in Table1.

Table 1 .
Results of peak area (Y) as a function of the variable (mass) or fixed (concentration) volumes of SIM injected in the chromatographic system a a Chromatographic conditions: RP-8 (250 × 4 mm, 5 mm) column, 30 °C, l 238 nm, methanol:0.1% phosphoric acid (80:20 v/v), 1.5 mL min -1 ; b each value represents the average of three injections; c injection of 10 mL; RSD: relative standard deviation; s 2 : variance.Figure 1.Studentized residual plots for SIM (a) mass (40-800 ng) and (b) concentrations (4-80 mg mL -1 ) after adjusting the linear model using OLSM.Chromatographic conditions as in Table1.Full black circle is an outlier.

Table 2 .
Results obtained for linear and quadratic regression models calculated by OLSM or WLSM for SIM determination using HPLC a a Chromatographic conditions as in Table1; b W: weights calculated as in FigureS3(in the SI section); c MEP: mean quadratic error of prediction; d MRE: mean relative error; e N out : number of outliers; f statistically significant if less than 0.05.