Classification of Tablets containing Dipyrone, Caffeine and Orphenadrine by Near Infrared Spectroscopy and Chemometric Tools

O objetivo deste estudo foi classificar amostras de comprimidos contendo dipirona, cafeína e orfenadrina usando espectroscopia no infravermelho próximo e técnicas quimiométricas. O conjunto de dados foi de 300 espectros de amostras de três comprimidos por lote e quatro diferentes produtores. O pré-processamento foi realizado pelo algoritmo Savitzky-Golay com primeira derivada, janela de 17 pontos e polinômio de segunda ordem. A classificação dos comprimidos foi conduzida usando modelos quimiométricas baseados na análise de componentes principais (PCA), modelagem independente flexível por analogias de classes (SIMCA), algoritmo genético(GA-LDA) e algoritmo das projeções sucessivas-análise discriminante linear (SPA-LDA). Pela análise PCA, observou-se agrupamentos para cada conjunto de comprimidos. Para o modelo SIMCA, utilizouse 15 e 30 medidas espectrais para o conjunto de treinamento dos medicamentos similares e de referência, respectivamente. Para o modelo GA-LDA, utilizou-se 12 variáveis, enquanto que o modelo SPA-LDA selecionou somente dois comprimentos de onda, 1572 e 1933 nm. Os modelos classificaram corretamente todas as amostras. A metodologia permitiu uma classificação rápida e não destrutiva das amostras e sem necessidade de determinações analíticas convencionais.


Introduction
Counterfeiting of medicines is part of a broader process involving the distribution of drugs that do not meet the standards of quality, safety and efficacy. According to the World Health Organization (WHO), spurious/ falsely-labeled/falsified/counterfeit (SFFC) medicines are those wrongly labeled, deliberately or misleading with respect to their identity or source. Tampering includes reference, similar generic products and may include products with correct, incorrect, insufficient or missing and/or with fake packaging active ingredients. [1][2][3] In most of the developed countries with effective systems of regulation and market control (i.e., United States, Australia, Canada, Japan, New Zealand and most European Union countries), the incidence of SFFC drugs is low, less than 1% of the market, according to estimates from the countries concerned. 3,4 In contrast, the highest incidence occurs in regions where the regulatory and supervisory systems are weak. In some developing countries, the SFFC drugs reach alarming 25% of the local market, which represents about 10% of the global pharmaceutical market. 4,5 Due to high amount of fraud and risk posed by these drugs, pharmaceutical products (medicines, cosmetics and related) have been subjected to safety requirements and quality assurance through technical regulations set by government authorities. These regulations are supported by voluntary technical activities that contribute to the quality of products, such as ISO 9001 standard. 5 In Brazil, the inspection agencies seized SFFC drugs, among them contraceptives, antibiotics and painkillers, more often containing dipyrone in the composition. 6 These drugs continue to be subject to forgery by their popularity and acceptance due to the high levels of marketing and consumption.
According to information publicly provided by the Boehringer Ingelheim Company, three main substances of the class of painkillers, that although being different molecules have the same purpose of pain sedation, holds 95% of the Brazilian market. Drugs with dipyrone lead with 39%, followed by paracetamol with 30% and aspirin with 26% of the market.
According to Santos et al.,7 in samples with two or more active ingredients present in a single formulation, the quantification must be performed by high performance liquid chromatography (HPLC) with UV detection. Although presenting accuracy and precision, the official method in question is also characterized by being laborious and expensive, often requiring pre-treatment of samples, ultra-pure reagents and specialized operators, and sample degradation and production of organic waste harmful to the environment take place. 15 However, numerous other analytical techniques have been proposed for the analysis of drugs, among which the near infrared (NIR) spectroscopy, 16-20 a rapid and non-destructive technique based on the absorption of electromagnetic radiation between 14000 and 4000 cm -1 (780 to 2500 nm).
The use of multiple analytical channels in the acquisition of chemical information for samples, as NIR spectroscopy, may be adequately exploited using multivariate analysis, extracting as much information as possible of data sets. In this context, pattern recognition techniques and NIR spectroscopy have been reported for the development of screening methodology for quality control of various matrices as fuel, 20-23 drink 24 and food. 25 Soft independent modeling of class analogies (SIMCA) 26 is a well known supervised pattern recognition method that uses principal component analysis (PCA) to model the hyperspace of each class. The PCA method promotes compression of a large data set and the variance is concentrated in few variables called principal components, i.e., for a set of k objects measured in sensors j generating the matrix X kxj , PCA reduced the matrix X into a product of two other arrays of low dimensionality T kxA (scores) and L jxA (loadings). The new variables in T present the advantage of being mutually orthogonal and A represents the number of new variables considered to be significant for the model of each class. The classification of new samples is carried out by means of an F-test at a given significance level.
The new variables present advantage of being mutually orthogonal, allowing the use of all spectral information in the construction of the SIMCA model, known as full spectrum method. This characteristic permits the detection of anomalous samples or outliers, present in the data set. 26,27 However, when employing full spectrum in the construction of mathematic models, many variables are redundant and/or non-informative, and their inclusion may affect the quality of the final model. Currently, a well-succeeded alternative to overcome this drawback is the use of variable selection techniques. 28-31 Vol. 24, No. 6, 2013 Among the various variable selection techniques, the genetic algorithm (GA), proposed in the 1960's by John H. Holland, is one of the most widespread. 32 The GA algorithm is mathematically based on the mechanisms of the Theory of Natural Selection by Charles R. Darwin to optimize complex systems, 33,34 seeking to replicate the evolution of the biological mechanism, exploiting all its advantages.
Araújo et al. 35 proposed the successive projection algorithm (SPA) to the variable selection in multiple linear regression (MLR). The SPA algorithm is the forward algorithm, with restriction that the selected variable in each interation is the least collinear to other selected variables. Pontes et al. 36 have proposed modification in SPA so that it could be coupled with linear discriminant analysis (LDA) in the variable selection to solve classification problems, that is showing satisfactory performance with respect to the classification of various matrices such as beer, 37 soils 38 and quantitative structure-activity relationship (QSAR) modeling. 39 Thus, the aim of this work was to develop a simple method for classification of drugs containing dipyrone, orphenadrine and caffeine by their identification and grouping by manufacturers, because in many cases the substitution of a more expensive product by cheaper ones is a clear case of falsification. This study used NIR spectroscopy combined with chemometric techniques for exploratory and classification analysis, with variable selection techniques.

Samples
Tablets containing dipyrone (300 mg), caffeine (50 mg) and orphenadrine (35 mg) were acquired in pharmacies from Ceará and Paraíba States (Brazil). The samples belong to four different brands, being one of reference (R) and three similar (S1, S2 and S3), with different excipients (Table 1) and manufacturing process. Three tablets per batch were analyzed, 20 batches of reference and 10 batches for each similar brand.

NIR spectra acquisition
Diffuse reflectance spectral measurements were performed, without any previous sample treatments or use of chemical reagents, using the XDS Rapid ContentTM Analyzer (FOSS), with 0.5 nm spectral resolution, equipped with holographic net and Si and PbS detection systems. Sample spectra were obtained on both sides of each of the tablets in the spectral range from 400 to 2500 nm.

Chemometric study
The spectra were preprocessed by a priori selection at the interval between 1100 and 2500 nm as work spectral region. To remove noise and baseline adjust, the spectra were then treated using the Savitzky-Golay algorithm with first derivative, 40 window of 17 points and second order polynomial in the Unscrambler 9.8 software.
The training set was used to obtain model parameters, and the validation set was used to choose the best number of the PCs for each class in the SIMCA model. In the GA-LDA and SPA-LDA models, the validation set was used to guide the variable selection, a strategy to avoid overfitting. 42 The test GA-LDA was used to select variables employing the G function as cost function. The mutation and reproduction probabilities were kept constant, 10 and 60%, respectively. The initial population was 100 individuals, with 50 generations. SPA-LDA was used in the standard conditions, as previously described. 36,37 Results and Discussion Figure 1a shows the raw diffuse reflectance spectra of the 150 samples, average of two spectra (two sides of the same tablet) in the range 1100 to 2500 nm, obtained under 0.5 nm resolution. The baseline variation was corrected using the Savitzky-Golay filter with first derivative, window of 17 points and second-order polynomial. Pre-processed spectra are shown in Figure 1b.

Exploratory data analysis
A study of unsupervised pattern recognition was conducted using PCA. PCA was used to evaluate the discriminating power of the spectra with respect to drug manufactures (similar or reference drugs). Figure 2a details the graphical representation of the scores of PC1 versus PC2 of the NIR preprocessed spectra. The cumulative variance in the first two PCs is 91%, being possible to observe a separation with no overlapping of the classes of drugs addressed in this case study.
It is possible to observe based on Figure 2a that S2, S3 and R classes are discriminated in PC1, although the S1 and S2 classes are overlapped. In Figure 2b, it was observed that three wavelengths (1650, 1934 and 2139 nm) were more informative in PC1.
The wavelength 1650 nm can be associated to the first overtone of aromatics 31 certainly due to functional groups of  active products. At 1934 nm, transitions such as the second overtone of carbonyl, OH of water or RCO 2 H, RC 2 HR' and CONH 2 groups take place. 31 The transitions at 2139 nm are associated to ROH and combination bands of CONH 2 (R). 31 The excipient chemical composition of the S2, S3 and R drugs presents at least one different compound, explaining the non-overlapping of the classes in PC1. However, all excipients in S1 are also in S2 (Table 1) perhaps due to similarities of composition, the overlapping between S1 and S2 occurred in PC1.
On the other hand, S2 presents more excipients than S1, and the differences are sodium starch glycolate, silicon dioxide, disodium edetate, lactose and sodium metabisulfite. The presence of these excipients explains the non-overlapping between S1 and S2 in PC2 (Figure 2a).
The loading graphs in PC2 (gray dash in Figure 2b) present two more significant wavelengths: at 1573 nm that occurs in the region of the first aromatic overtone, and at 1396 nm with information about the first overtone transitions of CH 3 , CH 2 and CH, ArOH, ROH, H 2 O and NH, that permits to discriminate the S1 and S2 drugs.

SIMCA classification
After the data partition, a study of the supervised pattern recognition was performed using the SIMCA technique at 75, 95 and 99% statistical significance, using the full spectral range and validation by test series. All samples were correctly classified at the three levels of statistical significance employed. The SIMCA models of R, S1 and S2 were constructed with 4 PCs and S3 with 5 PCs.
The results denote the potentiality of the NIR spectrometry to identify the counterfeit drugs. However, the use of wide spectral ranges, as in this case, makes the process of data modeling expensive at the computational viewpoint. Therefore, the possibility of obtaining similar results to the full spectrum using a representative subset of variables was investigated, making final models more parsimonious and interpretable.

GA-LDA classification
Using the genetic algorithm coupled with LDA model, it was selected 12 variables among the 2784 available, and these are highlighted in the average spectrum of all the others (Figure 3a).
The variables selected by GA are spread throughout the spectrum in regions as the second overtone of CH around 1200 nm, the first overtone of CH and SH, and also in combination with the overtone region of CH around 2200 nm.
Using the 12 selected wavelengths, it was obtained the Fisher scores for all the samples of the data set ( Figure 3b). The Fisher scores consist in the linear combination of selected variables. Vector constants of the linear combination obtained with training set samples minimize intra-class variance and maximize inter-class variance. 43 There is an even greater effect of homogeneity between classes, being obtained no misclassification, using only the 12 wavelengths selected by GA in the LDA modeling. Figure 4 shows the screen plot associated with the variable selection with SPA-LDA, whose cost function minimum point was obtained with only two wavelengths. The wavelengths selected by SPA-LDA were 1572 and 1933 nm (Figure 5a). Around 1572 nm, it takes place the ArCH transition and the first overtone of NH and OH, 31 and it is possible to observe a good inter-class separation with just this variable (Figure 5b). The discrimination inter-class is increased with the bivariate projection between 1572 and 1933 nm, in which occur transitions of second overtone of carbonyl, OH in water, and RCO 2 H, RC 2 HR', CONH 2 groups. 31 Ethanol (OH transition) and povidone (NH transition) present in the similar drugs and absent in the reference drug certainly contribute to the discrimination between the samples. Similar drug S2 appears far from S1 and S3 in the subspace defined by selected variables in the SPA-LDA algorithm, reflecting its greater complexity in terms of excipient composition (Table 1). On the other hand, the presence of sodium starch glycolate only in S2 and R makes them nearer (Figure 5b). S1 and S3 drugs are similar in terms of excipient composition, and so, appear as near groups in the Figure 5b. The difference between these drugs is the presence of talc in S3 and cellulose in S1.

SPA-LDA classification
Such characteristics in the only two variables selected by SPA were successful enough to discriminate groups of drugs in this study, whose models are simple and with 100% correct classification.

Conclusions
A classification method based on the modeling of NIR spectra with PCA, SIMCA and LDA with SPA and AG variable selection allowed a successful differentiation of the groups of the same drug, belonging to different brands. Thus, this method can be applied in the multiple identification of drugs contained in a single medicine, with use in the pharmaceutical industry as well as in the agencies that combat drugs counterfeiting.
The use of the GA-LDA and SPA-LDA methods allowed the homogeneous visualization of the classes using only 12 and 2 variables, respectively, with 100% correct classification, obtaining similar results to the full spectrum.