Fingerprinting black tea: when spectroscopy meets machine learning a novel workflow for geographical origin identification

,


Introduction
The 21st century has encompassed the Fourth Industrial Revolution and Information Age, an era marked with evolutionary technologies and massive quantities of big data (David et al., 2022).These recent advances in data science, particularly the development of artificial intelligence and more widespread use of machine learning, have reshaped many aspects of society.Simultaneously the health and wellbeing of our planet is being severely threatened by many challenges relating to climate change and sustainability.We live in a world where humans, animals, and the environment are connected under the "One Health" (FAO, 2023) in an unprecedented fashion.The significant threats to the global environment and economy create ideal socio-economic conditions for food fraud to thrive, as supply chains are challenges by more frequent extreme weather events, climate change and geopolitical tensions, whilst consumer income is under extreme pressure due to soaring inflation.Food fraud can be defined as "any deliberate action of businesses or individuals to deceive others in regard to the integrity of food to gain undue advantage".Food fraud is a direct challenge to food integrity, and as such, it is directly connected with a series of sustainability challenges, risking human, animal and environmental health, and leading to social impacts against the economy, justice, worker welfare and consumer trust.Conventional types of food fraud can include but are not limited to adulteration, substitution, dilution, tampering, simulation, counterfeiting, and misrepresentation and are perpetrated by a range of local, national, and global food chain actors as well as organised crime gangs (Soon & Wahab, 2022).The increase in globalisation and growing complexity of international supply chains has resulted in the emergence of new challenges, and the vulnerabilities which plague our food systems have been exposed through the exposure of public food fraud scandals (Lawrence et al., 2022).Thus, there is an urgent need to deliver rapid and cost-effective testing systems based on low cost analytical techniques coupled to data sciences using state-ofthe-art, multidisciplinary approaches, in order to detect and protect against food fraud.
Tea is the second most commonly consumed non-alcoholic beverage in the world and is produced from the new leaves and buds of the plant Camellia sinensis (Li et al., 2021).Amongst all tea categories, black tea currently dominates the global market and represents around 75 % of total worldwide consumption.Black tea production is forecast to increase annually by 2.1 % up to 2030 (FAO, 2022).The commercial price of black tea products is largely based on its distinct aroma, taste, and quality.These traits are mainly attributed by the selective expression of genes at the epigenetic level against the environmental stress from its habitat (Xia, Tong et al., 2020).Environmental factors such as soil fertility and soil elements, elevation, temperature, and precipitation can lead to substantial variation in the composition of chemical features within the plant.Tea can be processed in small batches using traditional crafts, alternatively it can be processed in much larger batches using modern technologies.These differences have further distinguished those artisanal teas from conventional products and has brought about up to a hundred-fold price variance amongst the different tea products, especially for those with a known geographical indication (GI).The protected designations of origin (PDO) and the protected geographical indication (PGI) systems, were first proposed by the European Union (European Commission, 2022) and later adopted by many other countries to protect these unique products by the implementation of regulations.Due to the similarity in physical and visual appearance of tea, discrimination through direct human sensory methods is subjective.Tea fraud occurs most commonly in ground tea powder which has been adulterated with hazardous chemicals, baking soda powder, non-food grade pigments, and spent tea ( European Commission, 2021).Additionally, Darjeeling tea, registered under the European Union PDO and PGI systems, has been reported as a prime example of fraud by geographical mislabelling, since the volume of Darjeeling tea sold worldwide far exceeds the reported production volumes (Kennedy et al., 2021).As one example, Indian police seized 4370 fake tea products with counterfeit labels from renowned brands ( European Commission, 2021) confirming fraudulent tea products to be a major area of concern.Such fraudulent activities damage the reputation of tea cultivation regions, erode consumer trust, and result in significant economic losses for producers.
According to tea industry stakeholders, tea authenticity testing is rarely performed, and the entire tea industry is purely 'based on trust', which makes the detection of such fraudulent activities extremely difficult as well as making the industry highly attractive to fraudsters.Meanwhile, the absence of robust regulations, comprehensive policy strategies, globally harmonized standards, and robust detection methods poses substantial challenges in terms of identifying, mitigating, and preventing tea fraud.The global black tea industry is under even greater threat due to the lack of transparency resulting in very high levels of vulnerability.Experts may be able to discern the geographical origin of a particular tea by visual perception and sensory evaluation.However, this analysis is highly subjective and lacks reproducibility due to individual and sample variability (Lim et al., 2021).
Targeted and non-targeted analytical methods have been applied for tea testing for several decades.For targeted analysis, several physicochemical indicators such as caffeine content, water extracts total polyphenols, catechins, free amino acids, as well as specific indicative compounds such as the pigments thearubigin and theabrownin (Fang, Huang, et al., 2019), have been quantified and used as testing parameters to evaluate the quality grades and tea categories.While targeted analysis is convenient in terms of strategy development, policy making, regulatory enforcement and inspections, it is insufficient to test sophisticated tea fraud issues, since the modus operandi of food fraudsters is to avoid indicators from standardised tests.In contrast, non-targeted analysis (NTA) has the advantage of obtaining comprehensive and unbiased profiles of the sample, generating high throughput data, and quickly extracting useful information through data mining (Cavanna et al., 2018).Several NTA techniques currently employed for discriminating the geographical origins of black tea are presented in Table S1 and include UV-Vis spectroscopy (Diniz et al., 2016), Fourier Transform Infrared (FTIR) spectroscopy (Arifah et al., 2022), Near Infrared (NIR) spectroscopy (Firmani et al., 2019), Nuclear Magnetic Resonance (NMR) spectroscopy (Cui et al., 2023), Inductively Coupled Plasma Mass Spectrometry (ICP-MS) (Ren et al., 2022), Liquid Chromatography MS (LC-MS) (Li et al., 2021), and Gas Chromatography MS (GC-MS) (Fang, Ning et al., 2019).However, MS-based instruments are sophisticated, expensive, time consuming, and have high demands for routine operation, which are only fit for laboratory confirmation (Black et al., 2016) and less suitable for rapid, on-site screening required by industry.In contrast, spectroscopy-based analytical tools can detect hundreds of molecular features simultaneously and provide a rapid, high throughput, unbiased, and non-destructive test, which is fit for on-site testing and effective management of fast-paced global food networks (McGrath et al., 2018).FTIR and NIR have the advantages of being rapid, non-destructive, cost efficient, environmental-friendly, and have shown successful applications in the real-time screening of food fraud in herbs and spices including garlic, black pepper, and oregano (McGrath et al., 2021).
When reviewing the scope of applications of non-targeted spectroscopic methods for traceability and authentication of black tea (Table S1), most studies have primarily focused on narrow-geographic origins, such as several provinces or core and non-core regions in China.Additionally, the number of samples analysed are usually limited to much less than 200 samples, making the assessment of real-world capabilities of these techniques challenging or indeed highly flawed.Moreover, many of these studies have primarily paid attention to discrimination only and have not extensively explored variable selection using machine learning.Conventional chemometrics such as partial least squares discriminant analysis (PLS-DA) rely on the assumption of a linear relationship between observed variables and response variables (Kamal & Karoui, 2015) which could lead to inferior prediction performance in real-life scenarios due to the characteristics of large sample numbers and factors relating to numeric influences.Recent advances in data sciences, especially with machine learning, have shown great potential in improving discrimination performance, minimizing model overfitting, and decreasing irrelevant features (Hong et al., 2023).
To the best of our knowledge, this study is the first to combine machine learning algorithms with FTIR and NIR spectroscopic analysis for the identification of geographical regions using a large number of black tea samples (n = 360).These samples were directly sourced from nine different GI regions across seven major tea cultivation and production countries around the world.The aim of this study is to develop a reliable and affordable analytical method to discriminate primary black tea cultivation regions worldwide using non-targeted spectroscopic techniques.This study focuses on the implementation of a real-time and sustainable spectroscopic screening test of black tea, facilitating the establishment of a comprehensive and transparent traceability system.Additionally, this study lays the foundation for the prevention of PGI, PDO, and geographical origin trademark infringements and the development of tea integrity risk management strategies against tea fraud.

Sample collection
A total of 360 black tea samples were collected through contacts in the tea industry directly from certified tea estates, plantations, factories, and smallholders.The collected samples originated from 4 cultivated African countries: Kenya (80 samples), Ethiopia (40 samples), Burundi (40 samples) and Malawi (40 samples); and 3 Asian countries: India (80 samples), Sri Lanka (40 samples, Ceylon black tea), and China (40 samples, Keemun black tea).Kenyan samples were further classified into 2 GI: Kenya Region 1 black tea (40 samples) and Kenya Region 2 black tea (40 samples).Indian samples were further classified into 2 GI: Darjeeling black tea (40 samples) and Assam black tea (40 samples).Detailed sample information is included in an Excel spreadsheet in the Supplementary Material.Tea samples were were properly marked, packed in sealed Polythene bags with a resealable strip to prevent from moisture and contamination, stored in black plastic recycled storage boxes at 22 • C ± 2 in a cool and dry environment.

Sample preparation
To minimize the influence of particle size, approximately 10 g from each black tea sample was added into a grinding jar and milled into a homogenous powder using a planetary ball mill (Retsch PM 100, Haan, Germany) for 3 mins at 500 RPM.All samples were treated identically to ensure the sample preparation was kept consistent prior to spectroscopic analysis.

Data acquisition
FTIR and NIR measurements were carried out using a Nicolet iS50 instrument (Thermo Scientific Inc., Dublin, Ireland) equipped with attenuated total reflectance (ATR) accessory (potassium bromide (KBr) diamond crystal).Following a background scan, the FTIR spectral data was collected within the range of 4000-400 cm − 1 with 32 scans at a resolution of 4 cm − 1 for each analysis.The NIR spectral data was collected within the range of 12000-4000 cm − 1 with 32 scans at a resolution of 4 cm − 1 for each test.All samples were scanned in triplicate and later averaged to mitigate the baseline shift caused by temperature and environmental changes in the lab.

Data pre-processing
During this work, the FTIR and NIR spectra of 360 authentic black tea samples were collected and later exported from OMNIC software (Thermo Scientific Inc., Dublin, Ireland).Replicates were averaged prior to the chemometric analysis.Full spectral range of NIR (12000-4000 cm − 1 ) and interval spectral range of FTIR (4000-2800 cm − 1 and 1800-550 cm − 1 ) were used for further analysis.The FTIR spectral region of 2800-1800 cm − 1 and 550-400 cm − 1 were removed from the analyses, considering its signal-to-noise ratio, and the absence of IR absorption peaks of interest within these intervals (Galvin-King et al., 2021).Light scattering is influenced by physical factors including particle size, shape and distribution.To remove undesired effects from the raw spectra and focus on only the data of interest, nine spectral pre-processing methods including Savitsky-Golay (SG), standard normal variate (SNV), multiplicative scatter correction (MSC), first-order derivation ( 1 DER) and their combination were explored in this study (Table S2).SG was applied to smooth the spectrum and reduce spectral noise, SNV and MSC were employed to reduce the multiplicative effects on the spectrum caused by light scattering and remove outliers, 1 DER was applied as a baseline correction technique to remove background influenced by particle size and to highlight characteristic peaks (Cruz-Tirado et al., 2023).The 1 DER was calculated according to the SG approach with 5 window points and 2nd polynomial.Unit variance scaling (mean centred, where the standard deviation is used as the scaling factor of each variable) was chosen for data normalization.

Multivariate statistical analysis
Initially, Principal Component Analysis (PCA) was used for unsupervised exploration of the data generated.PCA compresses original variables by projecting data in a new space called Principal Components (PCs) in order to extract the major sources of variance.PCA is regarded as the most common analysis for feature extraction (dimensionality reduction) and group clustering visualization (Cruz-Tirado et al., 2023).Afterwards, supervised analysis was conducted for discriminant analysis.Traditional chemometric modelling, namely PLS-DA, was first explored to separate groups and assess classification performance.PLS-DA is a supervised linear model based on the PLS algorithm used to address issues with many variables and covariance.Next, machine learning algorithms including Linear Discriminant Analysis (LDA), k-Nearest Neighbours (KNN), Support Vector Machine (SVM), and Random Forest (RF) were employed to improve overall classification accuracy.The LDA model is also a supervised linear model which firstly separates the group through constructed linear discriminant subspace, then assigns the unknown samples to a group through discriminant functions.In LDA, the variance of intra-group is taken to a minimum and the variance of inter-group are kept at maximum level apart, which can achieve good linear classification performance (Hong et al., 2019).The principle of the KNN classification algorithm involves determining attributes based on the category of the nearest k points when predicting new values (Yun et al., 2021).KNN algorithm is well-suited for parallel operation and resistant to noise in the analysed data.However, it is important to note that the hyperparameter 'k value' plays a crucial role in decision making, as each of the nearest 'k' neighbours holds equal significance in the KNN model.The SVM model functions by mapping nonlinear data from a low-dimensional feature space into a highdimensional space and constructing an optimal classification hyperplane to separate groups.It finds maximal margin hyperplanes with respect to a subset of the support vectors between different classes (Mustafa Abdullah & Mohsin Abdulazeez, 2021).SVM represents a convex optimization problem, which allows it to find the global minimum of the objective function instead of settling for a local optimal solution, thereby achieving better classification results.The RF model separates samples by randomly selecting variables and sample subsets to generate a forest of decision tree classifiers (Genuer et al., 2010).Each tree gives a classification results, and the majority vote within the forest is used to determine the final class (Santana et al., 2019).Therefore, the prediction accuracy improved, and the issue of model overfitting was controlled.However, it is essential to tune the hyperparameters of "decision tree numbers" and "selected variable numbers per tree" before evaluating the RF model performance.

Model validation and evaluation
The unsupervised PCA and supervised PLS-DA models were generated using SIMCA 14.1 chemometric software.To avoid model overfitting, PLS-DA models were validated by 7-fold cross-validation (including 309 samples in calibration dataset and 51 samples in test dataset per run) and permutation test (numbers of permutations = 200) using SIMCA software.The R 2 X , R 2 Y , and Q 2 values were applied to assess the PLS-DA model quality.R 2 X and R 2 Y measure if the model fits to the original data.R 2 X and R 2 Y represent the fraction of variance of the X and Y matrix, respectively.X matrix refers to FTIR or NIR spectral data, while Y matrix refers to corresponding classes.Q 2 represents the prediction accuracy of the model for test dataset.R 2 X , R 2 Y , and Q 2 values equal to indicate an effective model.
The machine learning classification models including LDA, KNN, SVM, and RF were conducted using R Studio software (version 4.0.5) with "MASS" package, "kknn" package, "svm" package, and "random-Forest" package, respectively.To avoid model overfitting, internal and external cross-validation was conducted.The entire workflow can be outlined as follows.
(a) The samples were split into a calibration set (75 % of dataset) and test set (25 % of dataset) through stratified splitting based on random sampling without replacement by R Studio software (version 4.0.5) with the "caret" package.Stratified splitting ensures that the class frequencies in the calibration set and test set match those of the overall dataset.The sampling was performed using random sampling without replacement to guarantee that test samples do not overlap with the calibration set, preventing any risk of model leakage and thus avoiding unreal model evaluation performance.In this study, all 360 samples obtained from nine GI regions were divided into a calibration dataset (270 samples, 30 samples per region) and test dataset (90 samples, samples per region).
Y. Li et al.(b) The calibration set was used to tune hyper-parameters and modelling.Internal validation was then applied using a 5-fold cross-validation to evaluate model prediction and accuracy of calibration models.(c) The samples in the test set acted as blind samples with no labels and were used to conduct external validation to evaluate model reliability and prediction performance.
Machine learning models were statistically validated by assessing the sensitivity (SEN) and accuracy (ACC) calculated using Equation 1 (Eq.1) and Equation 2(Eq.2),respectively.SEN(%) = TP/(TP + FN) × 100 (1) where TP = true positive, TN = true negative, FP = false positive and FN = false negative.The sensitivity (SEN) of each group refers to the ratio between the number of correctly predicted samples and the actual sample numbers within each group.The optimal prediction performance would have no misclassification within the confusion matrix, and a value of 100 % for both the sensitivity (SEN) of each group and whole accuracy (ACC).

Tea profiling using FTIR and NIR spectroscopy
The averaged raw spectra for black tea (comprising Burundi, Keemun, Ethiopia, Assam, Malawi, Ceylon, Darjeeling, Kenya 1, and Kenya 2) analysed using FTIR is shown in Fig. 1 (A).The pattern of absorbance values across each spectral band remained consistent, and it could be seen that the significant variations were found within the diagnostic region (3800-2800 cm − 1 ) and the fingerprint region (1800-550 cm − 1 ).The broad absorption band at 3310 cm − 1 was linked to the hydroxyl (OH) functional groups of alcohols and phenolic compounds (Brza et al., 2020).The band in the region of 1460-1420 cm − 1 was characteristic for the presence of C-H symmetric bending vibration of methylene groups and the C -H bending vibration of alkane (Lin & Sun, 2020) ( Brza et al., 2020).The averaged raw NIR spectra is shown in Fig. 1 (B).The bands at around 4711-4585 cm − 1 may relate to the N -H bending vibration of amino acid (Lin et al., 2020).The band at 5852-5733 cm − 1 indicated C -H stretching vibration (Diniz et al., 2014).Overall, the spectral differences between tea samples of different origins are negligible, thus making it impossible to directly distinguish GI through visual spectral inspection alone.Therefore, as a further step, spectral preprocessing and multivariate statistical analysis were used to determine nine different GI regions around the world.

Spectral pre-processing
Spectral pre-processing aims to minimise consistent baseline offsets and biases in the spectra due to differences in the scattering profile of solid samples.However, there is no sufficient evidence to suggest that a single mathematical pre-processing method is universally suitable for all types of data and purposes.Therefore, nine spectral pre-processing methods (highlighted in Table S2) were tested using PLS-DA modelling through SIMCA software to evaluate the optimal choice of preprocessing methods for this study.A summary of the PLS-DA model results for FTIR and NIR data is shown in Table 1.For the original raw data without any spectral pre-processing (M1-M2), the R 2 X for FTIR and NIR data were 0.999 and 1.000, respectively, which indicated the developed model has a high quality of fit.The Q 2 for FTIR and NIR data were 0.625 and 0.597, which indicated that the developed PLS-DA model based on FTIR/NIR data without spectral pre-processing has relatively poor predictive capabilities, but with the potential for sizeable improvement.
As one of the most commonly used denoising techniques, the SG method was used to suppress errors superimposed on raw spectral signals with the goal of improving the signal-to-noise ratio (SNR).After the addition of SG pre-processing (M3-M4), the Q 2 value for FTIR and NIR were 0.625 and 0.596, respectively, which suggests that SG smoothing did not significantly improve the SNR for classification.This phenomenon might be attributed to the high spectral resolution (4 cm − 1 ) utilized for FTIR and NIR data acquisition, obviating the need for additional smoothing procedures.Whereas after conducting SNV preprocessing (M5-M6), the Q 2 value for FTIR and NIR increased to 0.687 and 0.631, and the R 2 X value remained stable (R 2 X = 0.987 and 0.999 for FTIR and NIR, respectively), which indicated that SNV could improve the classification accuracy for independent datasets.MSC pre-processing showed the same trends (M7-M8).This might be explained by the fact that SNV and MSC could mitigate multiplicative effects and minimize the impact of light scattering phenomena.This finding was in accordance with Si-min (Yan et al., 2014) who reported that SNV has the ability to enhance the sensitivity and specificity of the PLS-DA model built to distinguish the Anxi and non-Anxi varieties of Oolong tea from China.After 1 DER pre-processing, the Q 2 value for FTIR (M9) increased to 0.710, and the R 2 X value was close to that of the Q 2 value, which suggested 1 DER individually had a good level of fit and high level of predictability.The same results were obtained using NIR data (M10).This phenomenon might be attributed to the 1 DER pre-processing have the advantages of removing constant background signals for baseline correction, enhancing the visual resolution, resolving overlapping peaks.
Afterwards, different combinations of pre-processing methods were applied to assess the impact on data fit and accuracy including SNV, MSC, and 1 DER and the order of occurrence on the prediction performance were taken into consideration (see Table 1 for different combinations applied).It is interesting that the Q 2 value of 1 DER + SNV and SNV + 1 DER were different for both FTIR (M11, M13) and NIR (M15, M17), which indicated that the prediction performance is influenced by the orders of different combinations of spectral pre-processing methods.In addition, the results showed that the most suitable pre-processing for FTIR data was SNV plus 1 DER (M13) since it led to the highest prediction performance (Q 2 = 0.754).Whereas for NIR data, the multi-class PLS-DA model with 1 DER plus SNV pre-processing (M15) gave the highest prediction Q 2 value of 0.731.
In summary, spectral pre-processing has shown the ability to increase prediction ability, although the improvement degrees are diverse.SG showed very little enhancement whether for FTIR or NIR data, while 1 DER improved the values significantly.Furthermore, the performance of combinations of several pre-processing methods were better than one individual method.Finally, SNV plus 1 DER (Fig. 1 C) was determined as the optimal spectral pre-processing method for multi-class GI discrimination models using FTIR spectra, while 1 DER plus SNV (Fig. 1 D) was identified to be most suitable for models using NIR spectra.The preprocessed spectral data was used for further chemometrics modelling and machine learning modelling.

Unsupervised PCA exploration
Initially, as one of the unsupervised models, PCA was constructed to observe how the samples were clustering and separating by themselves with no supervision.Multi-class unsupervised PCA score plots of the FTIR and NIR data are highlighted in Fig. 2. For FTIR, the first 2 PCs cumulatively accounted for 43.3 % of the total variation with PC1 explaining 24.8 % and PC2 explaining 18.5 % (Fig. 2 A).The R 2 X value and Q 2 value were 0.840 and 0.813, respectively.Differences among groups in the PCA score plot are generally caused by major variance contributors, for PCs in the PCA model corresponding to the directions of highest variance.The groups of African countries including Kenya, Burundi, and Malawi were clustered with negative scores in PC1.In contrast, clusters of Asian regions, such as Keemun, Darjeeling, and Ceylon, were observed with positive scores in the same PC, which suggested that PC1 may be associated with the variances between Asian and African countries detectable using FTIR.In PC2, samples from Keemun were distributed with positive scores whereas other Asian samples presented negative scores.
For NIR, the first two PCs cumulatively accounted for 56.4 % of the total variation with PC1 explaining 33.8 % and PC2 explaining 22.5 %.The R 2 X value and Q 2 value were 0.772 and 0.742, respectively.From the NIR-PCA score plot (Fig. 2 B), it was observed that the intra-group variances of Asian countries, especially for Keemun, Darjeeling, and Assam, were larger than African countries like Kenya, Burundi, and Ethiopia.This phenomenon indicated that the sample collected from Asian regions possess greater diversity and representativeness, which might be a result of the substantial amount of market value with GI such as Keenum (China), Darjeeling (India), and Assam (India) black tea.The tea samples collected from African countries of Malawi, Ethiopia, Kenya, and Burundi, were predominantly situated in the centre of the PC1 and PC2 axes.At the same time, the separation amongst groups was more difficult to observe, as samples from one region were overlapping with those from other regions, thereby making it challenging to directly discriminate nine GI regions using PCA analysis.In summary, the unsupervised PCA analysis showed natural intra-group variance and group separation amongst samples from different GI regions, which demonstrated the robustness and reliability of the data for further chemometric modelling using supervised techniques.

Traditional PLS-DA analysis
Next, supervised PLS-DA modelling was carried out through 7-fold cross validation to further explore the differences of black tea amongst the nine geographical regions.For FTIR data, the PLS-DA model showed a more pronounced separation and good data prediction ability, mainly reflected in the separation of Darjeeling and Keemun from other regions (Fig. 2 C).The first two components cumulatively accounted for 34.6 % of the total variation with predictive component 1 (P1) highlighting the variation between African and Asian tea samples, and the P2 highlighting the variation amongst samples from China, Sri Lanka, and India (12.2 %) (Fig. 2 C).After 7-fold cross validation, the accumulated variance contribution rate was 84.3 % (R 2 X = 0.843), indicating that the model was a good fit for the dataset.Furthermore, the Q 2 value was 0.812, which demonstrates that the model had acceptable predictability against external datasets.To ensure the model was not overfitted, permutation tests of the PLS-DA model was carried out through replacement trials.The order of labels was randomly permuted times, and separate models were fitted to all the permuted labels.The permutation plot (shown in Fig. 2 E) displayed the correlation coefficient between the actual sample labels and the permuted labels on the x- axis versus the cumulative R 2 and Q 2 on the Y-axis.As a measure of model overfitting, the intercept of R 2 and Q 2 were 0.148 and − 0.409, respectively, suggesting that the model was not overfitted and was reliable.The PLS-DA score plot of NIR data was shown in Fig. 2 D. The first two components cumulatively accounted for 56.3 % of the total variation with P1 explaining 33.8 % and P2 explaining 22.5 % of the total variation.The groups of Darjeeling clustered with negative scores in PC2, whereas the Assam group clustered with positive scores in PC2.After 7-fold cross validation, the R 2 X value was 0.763 and Q 2 value was 0.768, indicating a variance contribution rate and predictability of 76.3 % and 76.8 %, respectively.After permutation test of 200 replacement trials, the intercept of R 2 and Q 2 were 0.518 and − 0.52 (Fig. 2 F), suggesting that the model was not overfitted.However, it was considered possible that the prediction accuracy could be further improved upon by the application of machine learning.

Machine learning analysis
To improve the discrimination of black tea geographical origin, 4 types of machine learning classification models were conducted: LDA, KNN, SVM, and RF.

Machine learning modelling
Firstly, LDA modelling was conducted.In this study, orthogonal PCs extracted from PCA with variances > 95 % were regarded as input variables into LDA linear classifiers.The accumulative explained variance of PCs for FTIR and NIR data are shown in Fig. S1.For FTIR, the first 44 PCs with 95.1 % variance were used to construct the LDA model, and the corresponding plot was shown in Fig. 3 A. For NIR data, 139 PCs with 95 % explained variance were used to construct LDA model, and the corresponding plot was shown in Fig. 3 B. Overall, LDA score plots highlight the obvious separation amongst different geographical regions using both FTIR and NIR.The samples of intra-group clustered tightly with 95 % confidence ellipse which confirms that samples within the ellipse can be classified to one group with 95 % confidence.Furthermore, in FTIR data, by conducting a 5-fold cross validation, only one sample from Ceylon was misclassified as Assam (Fig. 3 C).The sensitivity of Ethiopia, Burundi, Malawi, Keemun, Assam, Darjeeling, Kenya, and Kenya region 2 reached 100 % and the accuracy was 99.6 %, suggesting good separation and robustness using LDA.For NIR data, after 5fold cross validation, only one sample from Kenya region 2 was misclassified as Kenya (Fig. 3 D).This might be explained by the adjacent lands in Kenya that share similar plantation climate and black tea production procedures.
Secondly, KNN modelling was conducted.In this study, the "K-value" and "Distance" hyperparameters were tuned using the calibration dataset through grid search (Table S3-S4).Finally, the "K-value" and "Distance" were set as 5 and 1, respectively to construct the KNN model with a rectangular kernel; the prediction results for the KNN model are highlighted in Table 2.After 5-fold cross validation, the overall accuracy of KNN based on FTIR data was 100 % indicating very good discrimination of geographical origins for black tea samples, with no samples misclassified by the confusion matrix (Table S5).However, for NIR data, only a 97.4 % correct classification rate was obtained with two samples from Assam misclassified by the confusion matrix as Burundi and Ceylon (Table S6).In summary, the prediction performance of the KNN model whether for NIR or FTIR was improved compared to the PLS-DA model.These results were consistent with Xu (Xu et al., 2012), who reported that the KNN method could achieve discrimination for Chinese green tea production seasons with 94.8 % accuracy.
Thirdly, SVM modelling with a linear kernel was carried out in this study.The hyperparameters of SVM mode was linear SVM classifiers with cost value = 0.05, as it achieved the best identification accuracies for nine regions using both NIR and FTIR in a trial experiment of hyperparameter tune.Fig. 3 E illustrates the confusion matrix for the SVM model obtained through a 5-fold cross validation using FTIR data.
The SVM model based on FTIR data achieved the 100 % accuracy and 100 % sensitivity for each group (Table 2).Good separation and prediction ability were also confirmed using NIR data with an accuracy value of 99.3 % (Table 2).Among the 270 samples tested, only one sample from Ceylon was misclassified as Assam and one sample from Darjeeling was misclassified as Kenya group, as confirmed by the confusion matrix (Fig. 3 F).These results were in accordance with Cardoso (Cardoso & Poppi, 2021), who found that SVM provided accuracy increases of 11 % when compared to PLS-DA model for identifying green tea adulterations using handheld and benchtop NIR spectrometers.
Finally, RF modelling based on decision tree and bagging strategy was conducted.After trial calculation of hyperparameter tune, 217 decision trees with 15 randomly selected variables per tree (Fig. S2-S5) were deemed as optimal RF models for FTIR data.While 600 decision trees with 80 randomly selected variables per tree (Fig. S6-S8) were regarded as optimal for NIR data.Fig. 4 A-B highlights the confusion matrix for RF model obtained through 5-fold cross validation for FTIR and NIR data, respectively.The accuracy of the FTIR calibration set was calculated as 99.6 % (Table 2), indicating good separation ability.Only one sample from Darjeeling was misclassified as Assam (Fig. 4 A).The prediction performance of RF model for NIR data was also good and the accuracy achieved 99.3 % (Table 2).It has been reported that RF models are capable of handling missing data, correlated predictor variables and nonlinearity.In addition, they are insensitive to noise and can handle very large numbers of input variables with high dimensionality (Deng et al., 2020).

External validation
The external test dataset made up of an independent 90 samples (with 10 samples per geographical region) was tested on the already established supervised models and the results for five models are shown in Table 2.The best classification results in the test set were obtained using SVM, KNN, and RF model (for FTIR data) and LDA model (for NIR data), giving 100 % sensitivity and 100 % accuracy for nine different GI origins (Table 2), indicating reliability and good generalization of the model data.Although for FTIR data, the accuracy of LDA model was 98.89 %, it still showed good predictability for unknown samples with only one sample from Kenya being misclassified by the confusion matrix as Burundi for the LDA model (Table S7).On the contrary, the LDA model based on the data obtained using NIR achieved 100 % accuracy for 9 geographical origins (Table 2), suggesting that the most suitable machine learning models are different for various analytical techniques.For NIR data, the accuracy of all other machine learning models was higher than 97.7 %, demonstrating good prediction ability for new samples (Table 2).Amongst them, samples from Darjeeling and Assam are commonly misclassified.

Key spectral bands identification
To identify important spectral bands for the discrimination of nine GI regions, the variable importance (VI) was evaluated through the Gini importance index and permutation importance index of the RF model.This evaluation was validated by rebuilding the RF model for both internal and external validation.Gini importance index is defined as the average of its importance values across all trees in the forest.Permutation importance index is derived through randomly shuffling the values of one analysed variable in the out of bag (OOB) samples and comparing the classification accuracy between the intact OOB samples and the OOB samples with the particular feature permutated (Chen et al., 2011).Through variable selection, 438 features were selected out of the original 1271 variables using FTIR spectra, and 481 features were selected out of the original 4150 variables using NIR spectra (shown as grey lines in Fig. 4   epicatechin gallate (Xia, Wang et al., 2020).In the NIR spectra, absorption at around 4344 cm − 1 corresponded to C-H stretching vibration and C -C stretching vibration and the band at 4250-4265 cm − 1 related to C-H symmetric stretching vibration and C-H bending vibration (Lin et al., 2020).To validate that these variables are indeed fundamental for discriminating GI regions, the RF model was rebuilt using the selected ones, and the predicted results of internal validation are shown in Fig. 4 E-F.In external validation, the rebuilt RF model based on selected variables achieved higher than 98 % accuracy (Table S8).These suggested that the variable importance evaluation based on RF model was validated and effective at identifying key wavenumbers that are associated with the black tea authenticity of nine GI regions.

Summary
In summary, to improve the classification performance of black tea, four machine learning multi-classification models including, LDA, KNN, SVM, and RF were built based on the data obtained using FTIR and NIR spectroscopy.The results proved that machine learning models could improve the prediction performance over conventional PLS-DA modelling.The machine learning models were validated using an internal 5fold cross-validation and external independent validation.After validation, the best classification results between the two individual spectroscopic techniques were obtained by FTIR spectroscopy with KNN and SVM model, giving the classification accuracy of 100 % for internal and external validation.In addition, when comparing the robustness of the machine learning method used to identify nine GI regions of black tea for both NIR and FTIR, it was determined that SVM and LDA models were superior to both RF and KNN, as the classification accuracy of the models were higher than 98 % for both internal and external validation.Furthermore, variable importance evaluation based on the RF model was introduced to discover the most important spectral bands for interpretation.Non-targeted FTIR and NIR fingerprinting analysis technologies are fast, convenient, and effective, enabling the rapid geographical traceability screening of black tea within the global supply chain.As a result, suspect tea samples related to their geographical origin could be screened on-site, with only samples requiring further confirmatory analysis being sent to a laboratory in cases of suspicion or inconclusive results.
Future work should focus on validating model transferability across portable and handheld devices for on-site and real-time screening.In addition, more representative samples could be collected year by year to update the black tea GI regions database and enhance the robustness of the models.

Conclusion
The proposed non-targeted spectroscopic fingerprinting workflow using FTIR and NIR techniques combined with machine learning algorithms, offers an efficient, robust, and rapid method for discriminating GI regions of black tea.In this study, a total of 360 black tea samples sourced from prominent tea cultivation regions across the world, including China, Darjeeling (India), Assam (India), Sri Lanka, Kenya, Ethiopia, Burundi, and Malawi were analysed.The results indicated that the combination of SNV and 1 DER spectral pre-processing methods can increase the robustness of the prediction model.Additionally, machine learning models including LDA, KNN, SVM, and RF have demonstrated superior prediction performance compared to traditional PLS-DA modelling.The best classification results from the two spectroscopic techniques were obtained using FTIR spectroscopy combined with KNN and SVM modelling, giving the classification accuracy of 100 % through internal 5-fold cross-validation and independent external validation.Furthermore, a set of significant wavenumber regions in FTIR and NIR spectra for discriminating black tea GI regions were identified and validated.Overall, the developed workflow is a novel, rapid, easy to operate, cost-efficient, and non-destructive method, and it can be regarded as a "green analytical technique" since no solvents and reagents are used during the process.This work has shown excellent potential to be extended towards an on-site, real-time solution for industrial applications and lays the foundation for GI inspections throughout the entire supply chain, particularly within developing countries where tea cultivation is prominent.However, further work is required, such as enhancing the size and diversity of the database and focusing on validating model transferability across portable and handheld devices.Y. Li et al.

Fig. 2 .
Fig. 2. PCA and PLS-DA analysis of geographical origin multi-classification. (A) PCA score plot of FTIR spectra; (B) PCA score plot of NIR spectra; (C) PLS-DA score plot of FTIR spectra; (D) PLS-DA score plot of NIR spectra; (E) Permutation test plot of FTIR spectra; (F) Permutation test plot of NIR spectra.
C-D).The variables at around 1220-1240 cm − 1 gave evidence of C -O stretching vibration of epigallocatechin and scissor bending vibration of L-theanine and the band in the region of 1137-1156 cm − 1 corresponds to the C -O -C stretching vibration of gallocatechin, Y.Li et al.

Fig. 3 .
Fig. 3. Machine learning analysis of geographical origin multi-classification validated by internal validation.(A) LDA score plot of FTIR spectra; (B) LDA score plot of NIR spectra; (C) Confusion matrix of LDA model based on FTIR spectra; (D) Confusion matrix of LDA model based on NIR spectra; (E) Confusion matrix of SVM model based on FTIR spectra; (F) Confusion matrix of SVM model based on NIR spectra.

Fig. 4 .
Fig. 4. Random Forest analysis of geographical origin multi-classification validated by internal validation.(A) Confusion matrix of RF model based on FTIR spectra; (B) Confusion matrix of RF model based on NIR spectra; (C) Averaged FTIR spectra with variable important features based on RF model; (D) Averaged NIR spectra with variable important features based on RF model; (E) Confusion matrix of RF model based on variable important features of FTIR spectra; (F) Confusion matrix of RF model based on variable important features of NIR spectra.

Table 1
Summary of multi-classification results for spectral pre-processing methods based on the PLS-DA model.
Y.Li et al.