Geographical classi ̄ cation of Nanfeng mandarin by near infrared spectroscopy coupled with chemometrics methods

Xuan Zhang*, Yiping Du*‡, Peijin Tong*, Yuanlong Wei and Man Wang* *Shanghai Key Laboratory of Functional Materials Chemistry and Research Center of Analysis and Test East China University of Science and Technology Meilong Rd 130, Shanghai, P. R. China 200237 Comprehensive Technology Center of Jiangxi Entry-Exit Inspection and Quarantine Bureau and Jiangxi Province Engineering Research Center of Infrared Spectroscopy Application South Gan River Avenue 2666, Nanchang Jiangxi Province, P. R. China 330038 yipingdu@ecust.edu.cn


Introduction
Nanfeng mandarin, native to Nanfeng county, Jiangxi province, as one of the famous and precious citrus varieties in China, is a kind of distinctive products of geographical indication with a very long cultivation history.It is an extremely popular product and an important international traded commodity in Nangfeng county, with its features of thin skin, soft and succulent pulp, sweet and sour tastes, intense in aroma, unique °avor and seedless.Thus, Chinese government has established the corresponding national standard on the protection of this citrus variety.a Furthermore, owning to its excellent quality, it has been widely naturalized all over China, e.g., Fujian, Guangxi, Hunan province, etc.The Nanfeng mandarin, introduced to other places in China grows well and shows similar appearance, however, there are di®erences in taste to some extent.Meanwhile, with the development of international trade and improvement of quality of people's life, the requirement of product quality is higher than before.Therefore, with the aim of guaranteeing authenticity and protecting the consumer from fraudulent labeling of mandarin, a means of di®erentiating mandarins from di®erent geographical locations must be devised.
In recent years, near infrared spectroscopy (NIRS) with the advanced features of fast, simple, cheap and nondestructive, has attracted considerable attention for the qualitative and quantitative analyses in food and agriculture industry. 1,2For the purpose of quality control, NIRS data have been e®ectively combined with multivariate techniques such as cluster analysis (CA), 3 principal component analysis (PCA). 4 and arti¯cial neural network (ANN), 5 etc. to identify authenticity, producing area, as well as similar varieties of the analyzed samples.For instance, there are many domestic and international scholars using NIRS technique to identify di®erent varieties of co®ee, 6,7 juicy peach, 8 tea, 9 tobacco, 10 etc.Likewise, many researchers have reported the authentication and geographical classi¯cation of honey, 11 olive oils, 12,13 red wines, 14 cheese, 15 apple juice, 16 etc.In our group, 17 the feasibility of identi¯cation of Nanfeng mandarins from varied regions employing NIRS and PCA has been studied, and declared better classi¯cation results.However, the very small sample size and the absence of independent prediction set of this study in°uenced the applicability and robustness of the classi¯cation models.Moreover, the samples from di®erent villages/towns in Nanfeng county Jiangxi province were not well separated when taking all samples from other provinces into account.
The aim of the present study is precisely to propose a strategy for developing improved and reliable classi¯cation model for accurate geographical identi¯cation of Nanfeng mandarin, which is essential for assessing mandarin quality, based on their NIR spectra.The PCA was selected as class-modeling method for this study, since it is a powerful data mining technique in multivariate calibration of spectral analysis, and can perform both numerical and graphical results.In this way, changeable size moving window partial least squares (CSMWPLS) 18 algorithm was modi¯ed and coupled with PCA to construct a variable selection method called changeable size moving window principal component analysis (CSMWPCA), that was applied on these spectra in order to improve the performance of classi¯cation models and reduce the size of datasets in calibration and validation process.For evaluating the e®ect of the wavelength selection technique on sample classi¯cation, the results obtained before and after feature selection were analyzed and compared.

Samples preparation
A total of 583 mandarin samples were harvested from 7 di®erent geographical origins, which are Fujian, Guangxi, Hunan province, and four di®erent villages/towns in Nanfeng county Jiangxi province, as shown in Table 1.In the trial, in order to a GB19051-2003.guarantee the representation of mandarin samples and extend the applicability of the classi¯cation models, the samples of each category explored in this experiment were chosen in the di®erent orchards of the same area, meanwhile, samples at di®erent heights and sunlight conditions were also considered in every orchard.

Recording of NIR spectra
All NIR spectra of Nanfeng mandarin samples between 1000 and 1800 nm were obtained using a SupNIR-1000 portable near infrared spectrometer (Focused Photonics Inc., Hangzhou, China) in diffuse re°ectance mode, equipped with a tungsten halogen lamp light source and an InGaAs detector.
A ¯ber-optics probe di®use re°ectance accessory was placed on the surface of the intact mandarin sample to collect spectra at ambient temperature (ca.298 K) with 10 scans and a resolution of 1 nm.And in order to reduce the error of operation, the probe was covered with a tin foil paper to keep a beam diameter of 1 cm.Each mandarin was measured three times at three equally equatorial positions and the average spectrum of three parallel measurements was used.Spectra were recorded in random order and a reference spectrum was measured with 15 min interval during the spectra measurement.

Pretreatment of measured NIR spectra
The measured NIR spectra always comprise substantial information derived from sample attributes, as well as environmental and instrumental information which strongly take an e®ect on the performance of the analysis system.In order to remove the scattering e®ect created by di®use re°ectance, decrease baseline shifts, overlapping peak and the detrimental e®ects on the signal-to-noise ratio, multiplicative scatter correction (MSC), 19 standard normal variate (SNV), 20,21 Savitsky-Golay derivative 22 and their combinations were applied to the spectral pretreatment before PCA.

Principal component analysis
The classi¯cation of Nanfeng mandarin samples by geographical area requires a method that yields a positive identi¯cation, i.e., a sample should be classi¯ed as belonging to a class only if it is similar enough to that considered class. 23In fact, the most commonly used methods for classi¯cation in chemometrics are the visual dimensional reduction methods based on latent projections.PCA 4,24 as a much-used such method for providing unsupervised visual classi¯cation was introduced in this study.It converts each NIR spectrum vector into a single point in principal component space (i.e., PCs), without losing the feature of data structure.Then, if the captured variance is relevant to chemical variations and sample classi¯cation, similar sample scores should cluster together on a graphical scores plot of PC 1 versus PC 2 or even PC 3.

Changeable size moving window partial least squares
NIR spectra often contain some irrelevant variables for classi¯cation, which may lessen both the accuracy and robustness of the models.Variable selection can discard signals that are not useful for classi¯cation, while primarily retaining signals that have information correlating with sample groups, to make the model simpler and obtain a better interpretation and lower measure system costs. 25,268][29][30][31][32] CSMWPLS, 18 as one of them, is a strategy to search for an optimized subregion in spectral regions for producing better results.The superiorities of this method are: the window size is changeable and the window moves through the whole spectral region with ¯xed step.As shown in Fig. 1, the process of this algorithm is as follows: ¯rstly, a spectral window that starts at the (i)th spectral channel and ends at the (i þ w À 1)th spectral channel is constructed, where w is the window size.Then the window is moved Fig. 1.The speci¯c process of CSMWPLS algorithm.
through the whole region with a step of 1.Thus, there are (n À w þ 1) windows over the whole spectra, and with each window the calibration model is built with the corresponding subset of the spectral X. Afterward, the window size varies at an adjustable increment and the aforementioned procedures is run repeatedly with the new window size.
After calculations for all the subsets, the region with the best prediction results is chosen as the informative region.

Data processing of (CSMWPCA)
According to the basic idea of CSMWPLS, the strategy of CSMW was utilized and replacing PLS with PCA to construct CSMWPCA.In this study, the window size varied from 20 to 800 with an interval of 5.The dataset of 583 samples was split as following: for each category, the samples were divided into calibration and validation sets by a ratio of ca.9:1 (as shown in Table 1).Thus, in total 524 samples were taken as the calibration set, and the remaining 59 samples were selected to be the external validation.
When di®erent pretreatment methods were applied to remove information not related to classi¯cation and CSMWPCA technique was used to select an optimized sub-region in spectral regions for producing better results, the quality of the PCA classi¯cation models were compared according to several evaluation parameters: Total classi¯cation (prediction) rate: Category rate: These two equations were applied in both calibration and external validation sets, where m ci and N ci are the correct classi¯cation or prediction number and the total classi¯cation number of one category, respectively, and N is the total classi¯cation or prediction number of all categories.Meanwhile, graphical scores plots were also used to illustrate the goodness of the models.Data pretreatments, variable selection and PCA in this study were carried out by self-editing programs in MATLAB (Ver.7.1: The MATHWORKS, USA).

Near infrared spectra
The original NIR spectra of all Nanfeng mandarin samples from seven di®erent geographical areas are displayed in Fig. 2. All the achieved spectra data were averaged.No obvious di®erences were detected from a visual observation of the spectra among the seven category samples in the whole spectral range.And all samples have two signi¯cant absorption bands around 1190 and 1450 nm, which are generally assigned as the peaks of water, because of their high water content.Therefore, multivariate calibration techniques must be used for modeling based on near infrared spectra.And in this study, chemometric data reduction (CSMWPCA) and pattern recognition methods are a natural choice for analysis of such complex, inter-related NIR spectral data.

PCA applied to raw data
The PCA aimed to map the spectroscopy signals on to a low-dimension space with the largest variability.When PCA was applied to the raw NIR spectral data of the 524 mandarin samples in the calibration set, the scores plot was shown in Fig. 3.In this three-dimensional scores plot, the \coordinates", i.e., the scores on the ¯rst three principal components, provide a measurement of distance of each sample to all categories.In this way, each class model is de¯ned by a class space delimited by the distance.Each model will accept samples whose distance to the corresponding central point is lower than that to other classes.Bearing in mind the categories studied here, as can be seen, class-models relating to all categories appear to be seriously overlapped, showing very poor separation.These graphical results can also be con¯rmed numerically in Table 2. total classi¯cation and prediction rate are 34.35% and 37.29%, and the category rates are from 6.74% to 74.07%.The fairly low correct rates reveal the problems for classi¯cation real samples in future, and serve for giving more sense to the aim pursued in this study, i.e., trying to improve the ¯nal classi¯cation model by wavelength selection to make possible a more accurate practical application.

PCA after pretreatment and CSMWPCA variable selection
In view of the results exhibited by the class models developed on the original NIR spectra, we decided to preprocess these spectra by di®erent pretreatment methods and select informative wavelength as an attempt to improve classi¯cation performance.Table 2 summarizes the classi¯cation and prediction rates corresponding to the class models developed on the basis of raw NIR spectra and the spectra after diverse spectral pretreatment and variable selection methods.It can be seen in Table 2, comparing the results obtained from all the pretreatment methods, that the application of second derivative showed relative success to obtain a satisfactory classi¯cation model, providing 91.79% and 88.14% total correct classi¯cations in both classi¯cation and prediction.
Once second derivative was selected as pretreatment method to be used for correcting NIR spectra, CSMWPCA was applied to select useful wavelength.It can be seen in Table 2 that after the wavelength selection, the resulting class model showed an excellent discriminant power with 97.52% and 96.61% total classi¯cation and prediction rates and 100% category rates for C2, C4, C5, C6 and C7, which indicated a good clustering e®ect of mandarin samples from varied producing areas, and described the sample diversity by qualitative analysis.These numerical results can be visually con¯rmed by scores plot (see Fig. 4) as well, showing a clear separation among classes and considerably improved with regard to the low model complexity (290 variables).All the mandarin samples are closely clustered in the each region of the PCs space, and the di®erences among most samples were pronounced.However, to samples from Fujian and Hunan province, i.e., C1 and C3, the diversity between them  was not such notable as others.Some samples are slightly overlapping.Nevertheless, in short, PC1, PC2 and PC3 provide a good image for the geographical classi¯cation of Nanfeng mandarins.

Conclusion
This study has illustrated the feasibility of applying NIRS combined with PCA to classify Nanfeng mandarin samples by geographical region.The success of this strategy depends largely on wavelength selection method which not only signi¯cantly enhances the quality of the classi¯cation model in terms of accuracy, but also makes the model simpler.The classi¯cation model constructed based on the second derivative pretreated spectra and CSMWPCA selected wavelength region prompted a substantial improvement in comparison with the model developed based on the original spectra.The notably improved model showed total classi¯cation rate of 97.52%, and a good prediction ability of 59 samples in an independent test set with the total prediction rate of 96.61%.The promising results reported in this study may serve to support the feasibility of providing a straightforward, fast and objective Nanfeng mandarin authentication and determination of producing areas of unknown samples.

Fig. 3 .
Fig. 3. Three-dimensional scores plot from PCA of raw NIR spectra for calibration samples.

Table 1 .
The origins of the samples in the research.

Table 2 .
Percentages of correctly classi¯ed samples.