The cluster analysis of traditional Chinese medicine authenticity identification technique assisted by chemometrics

This study explore the authenticity identification technique of traditional Chinese medicine (TCM) using chemometrics in conjunction with cluster analysis. A clustering Gaussian mixture model was constructed and applied for the data clustering analysis of four types of TCM. Chemical measurements combined with discrete wavelet transform (DWT), Fourier transform infrared spectroscopy (FTIR), and Fourier self-deconvolution (FSD) were utilized for the detailed differentiation of Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var. parvifolium. Differences in the attenuated total reflection-FTIR (ATR-FTIR) spectra among the four TCMs were observed. Utilizing clustering algorithms, the one-dimensional DWT of the infrared spectra of samples was employed for the authentication of Chinese herbal medicines. The model demonstrates optimal performance throughout 2000 rounds of network training. The accuracy (88.6 %), sensitivity (86.5 %), and specificity (82.7 %) of the model constructed in this study significantly surpassed those of the CNN model: accuracy (67.7 %), sensitivity (70.4 %), and specificity (68.5 %) (P < 0.05). By setting the cluster size K = 5 and the number of Gaussian mixture model components to 5, the model effectively fits the actual number of categories within the dataset. Infrared spectroscopy analysis revealed distinct carbon-oxygen stretching vibration absorption peaks between 1025 and 1200 cm−1 for Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var. parvifolium, indicating strong absorption peaks of carbohydrates. A comprehensive structural information analysis revealed a similarity of above 0.982 among the four types of TCM. Combined with chemometrics and intelligent algorithm-based cluster analysis, successful and accurate authentication of TCM authenticity was achieved, providing an effective methodology for quality control in TCM.


Research background and motivations
Traditional Chinese medicine (TCM) plays a significant role in maintaining public health.Authenticity identification of TCM has been a crucial issue within the field of TCM, acting as a pivotal aspect in quality control of TCM products [1].With the widespread

Research objectives
Traditional methodologies for authenticating Chinese herbal medicine are prone to subjective judgments, which can lead to potential interference from individual biases.However, integrating chemometrics with cluster analysis techniques mitigates subjective judgment errors by employing data-driven analysis, thereby enhancing the objectivity of authentication.The application of chemometrics allows for comprehensive and systematic determination of multiple components within Chinese medicine, which may improve the accuracy and credibility of authentication.By combining chemometrics and cluster analysis, multiple indicators can be considered simultaneously, facilitating a holistic evaluation of the authenticity of Chinese medicine and aiding in uncovering underlying patterns within complex systems.This study explores a novel approach to authenticating Chinese herbal medicine by integrating advanced chemometrics and cluster analysis techniques to analyze molecular characteristics.This integration offers a new and efficient solution for authenticating Chinese herbal medicine, thereby advancing quality control and supporting industrial development.Additionally, it aims to foster innovative approaches in the field of herbal authentication, driving the overall advancement of this domain.The flowchart of the current study is depicted in Fig. 1.

Chemometrics
The core of chemometrics involves translating complex chemical information from samples into mathematical models.In the analysis of authenticity in TCM, chemometrics employs two key methodologies: Partial Least Squares (PLS) regression and Principal Component Analysis (PCA).These methodologies are used for classification, clustering, and variance analysis of samples by processing mathematical models.

The following code implements PLS regression
PLS regression is a multivariate statistical method designed for building predictive models.It functions by establishing a linear regression model that maximizes covariance between the dependent and independent variables, extracting latent variables to predict the response variable.PLS is particularly well-suited for datasets with high correlation and multicollinearity among independent variables.In this study, PLS is used to analyze the relationship between infrared spectral data and the chemical composition of Chinese medicinal materials.By employing the PLS model, complex spectral data are simplified into a few latent variables, which are then used to predict sample categories, enabling the identification of authenticity and variety in Chinese medicinal materials.

The principles of the two algorithms
PCA is a dimensionality reduction technique that transforms high-dimensional data into lower-dimensional data while retaining as much of the original data variance as possible through linear transformations.By identifying the principal components, which are directions of maximum variance in the data, PCA projects the data onto these components, achieving dimensionality reduction and feature extraction.In this study, PCA is used to reduce the dimensionality of infrared spectral data.Through PCA, high-dimensional spectral data are transformed into several principal components that explain a significant portion of the data variance.The results of PCA enhance our understanding of the differences between various Chinese herbal samples and provide a foundation for further cluster analysis.
Fig. 1.Flowchart of the herbal processing procedure in this study.

Cluster analysis 2.2.1. Algorithm and model
The clustering algorithm primarily comprises three components.
i. Establishing a matrix A to represent the sample set; ii.Computing K eigenvalues and eigenvectors within A; Initially, the original sample data are mapped to a one-dimensional space (k = 1) within a multi-dimensional space.This space, denoted as B', is generated from K orthogonal vectors in a k-dimensional space.
iii.Utilizing the rows of the k-dimensional subspace A′ as a new data representation for the samples, followed by clustering of the samples.
The objective function operates within a one-dimensional space to optimally partition the data based on the principle of the objective function.It iteratively partitions the delineated subgraphs.Subsequently, the clustering algorithm functions within the kdimensional space, establishing a broad framework for the algorithm's execution.
The objective function equation is A = D, see equation ( 1): in the equation, D is a diagonal matrix, see equation ( 2): W ij represents the weights of connecting vertices i and j, and the Laplacian matrix of the graph is L = D-W.
Structural clustering utilizes a Gaussian mixture model, a probabilistic framework based on multiple Gaussian probability density functions.The corresponding equation ( 3) is as follows: The number of Gaussian components K is determined by minimizing the Bayesian information criterion, which assesses the fit of the Gaussian mixture model to the observed data.The training data is directly obtained from simulated raw output, categorizing some irregular aggregates as "other" classes.For irregular local results, the reconstruction loss should exceed that of aggregates with regular structures.The Gaussian model partitions the entire dataset by testing different components, calculating the maximum cluster loss HK when the component number is K, which represents the loss of irregular aggregates.equation ( 4) is as follows: H p,K and H w,K represent the reconstruction loss and weighted reconstruction loss of cluster K, which can be computed for different K values to obtain the optimal number of components, with the corresponding K obtained at the inflection point.
Utilizing the Gaussian model for structural clustering and employing autoencoders for dimensionality reduction, Fig. 2 illustrates the workflow for local structure recognition.Initially, the point cloud undergoes representation, followed by structural encoding and bottleneck dimensionality.Subsequently, structural clustering is performed, concluding with prototype visualization and visual fillings.The bottleneck dimension, denoted as C, is a critical factor in feature learning and network training; a dimension that is too small may lead to underfitting, while an excessively large dimension could result in overfitting.In this study, an online data augmentation network that randomly rotates point clouds at selected stages is employed.P' represents the reconstructed point cloud, for which the loss function is computed for each output in relation to the corresponding input, followed by the backpropagation process.

Loss function
The loss function quantifies the discrepancy between the predicted values and the ground truth, aiming to minimize this error through network training.equation ( 5) below represents the loss function employed in this study: in the equation, Wp represents the reconstruction loss; Wr represents the center-weighted reconstruction loss; Wn represents the overlap loss; Wc represents the cross-rotation-consistent loss; and We is the regularization loss.This work adopts the Chamfer distance as the reconstruction loss function, defined for two sets of points, P 1 and P 2 , of the same size.The Chamfer distance equation ( 6) is as follows: where, |N| represents the size of set N.
The center-weighted reconstruction loss function equally treats the mappings from input to output and from output to input.The aggregates in this study are defined based on the proximity to the central particles.After readjusting the reconstruction loss function, the asymmetric form of the Chamfer distance is represented by Equation (3).
The center weighted reconstruction loss Wr equation ( 7) is as follows: Qr, λ CW , and D CW indicate weight coefficients.
W n represents overlap loss, see equation ( 8) below: is the overlap function of particles i and j, Q n is the number of particle pairs, and M n is the weight coefficient of the loss function.W c represents cross rotation consensus loss.W e is the regularization loss, see equation ( 9) below: Rr is the weight coefficient of the loss function, and φ is the training parameter of the network.

Experimental design and performance evaluation
(1) Discrete wavelet transform (DWT) DWT is a signal processing technique that analyzes the frequency-domain characteristics of a signal by decomposing it into wavelet basis functions at different scales.Widely applied in signal processing, image processing, and data compression, DWT provides localized information in both time and frequency domains, enabling signal analysis across various scales.
(2) Fourier transform infrared spectroscopy (FTIR) FTIR provides comprehensive chemical composition data of herbal samples, aiding in the analysis of various organic constituents present in traditional medicines.FTIR spectra help identify typical functional groups within herbal components and their molecular structures.By comparing spectral features among different herbal medicines, FTIR assists in authenticating and detecting moisture content in herbal products, ensuring compliance with storage and usage standards.
(3) Fourier self-deconvolution (FSD) FSD enhances the resolution of spectra.Due to the complexity of herbal materials, employing FSD effectively facilitates the observation of overlapping absorption peaks in samples.Coupling this technique with DWT for image analysis and processing aids in reducing and optimizing data, thereby enabling accurate classification of authenticity in herbal medicines.

Experimental materials
Bupleurum scorzonerifolium refers to the dried roots of the umbrella plant Bupleurum scorzonerifolium Willd; Bupleurum yinchowense refers to the dried roots of the umbrella plant Bupleurum yinchowense Shan et Y. Li.Bupleurum marginatum represents the dried whole herb of Bupleurum marginatum Wall.exDC, a member of the Umbelliferae family.Bupleurum smithii Wolff var.parvifolium pertains to the dried roots of the Umbelliferae plant Bupleurum smithii Wolff var.parvifolium Shan et Y. Li.After collection, all samples are air-dried indoors and further dehydrated in an oven at 60 • C for 72 h.Ten randomly selected root samples are pulverized using a grinder into fine powder, weighed at 10.0 mg, and set aside for further analysis.

Measurement methodologies
The attenuated total reflection-FTIR (ATR-FTIR) spectroscopic technique was employed to analyze surface composition and structural information of materials.In this study, the diamond ATR accessory was placed horizontally in the sample compartment of the FTIR instrument as per instrument requirements, ensuring a consistent contact area between the material powder and the device.Mechanical calibration pressure was kept constant, and a background scan was conducted before each sample measurement.Each measurement was repeated three times.The obtained infrared spectra underwent baseline correction.The ATR-FTIR spectra were processed using the OMNIC E.S.P5.1 software to acquire FSD infrared spectra.Subsequently, a clustering algorithm was applied to analyze the one-dimensional continuous wavelet transforms of the sample's infrared spectra, aiming to observe differences in the ATR-FTIR spectra and thereby authenticate the authenticity of TCM.

Performance evaluation metrics
A reasonable assessment of performance metrics effectively evaluates the performance of algorithms.Ultrasound image crosssections were evaluated as a binary classification problem, with model-predicted categories divided into true negatives (TN), false positives (FP), true positives (TP), and false negatives (FN).The accuracy calculation is shown in Equation (10), where higher classification accuracy indicates better algorithm performance.Precision measures the proportion of true positive samples among all samples predicted as positive, while specificity, also known as recall, represents the proportion of true positive samples among all positive samples.The harmonic mean of recall and precision is higher when both recall and precision are high; if one is low, the harmonic mean decreases, approaching the lower value, as illustrated in equations 10-13.

Experimental environment
The network was trained using the LJSpeech publicly available dataset, which comprises 131,000 recorded data entries.The experimental hardware setup included 128 GB of RAM, two NVIDIA 1080 Ti GPUs, and an Intel Xeon® Silver 4110 CPU @ 2.10 GHz with 32 cores, running on the Ubuntu 16.04 operating system.PyTorch 1.2.2 served as the deep learning framework, utilizing Python Y. Bai and H. Zhang Heliyon 10 (2024) e37479 3.6 as the development language.

Parameter setting 3.3.1. Training network results
The Adam optimization methodology [34] was employed for training the network over 2000 epochs.The prediction mechanism was initiated after the first 500 epochs, during which the learning rate gradually increased from 0 to 0.001.Subsequently, a learning rate exponential decay commenced at 1500 epochs (Fig. 3).

Dataset parameter selection results
The dataset comprising LJ particles was selected to validate the algorithm.Initially, raw data were preprocessed to fit the algorithm model in this study, with the particle radius set at 0.5.The weights of the damage functions were specified as follows: γp = 0.1, γ0 = 0.001, and γc = 0.001, with a truncation radius of 1 in the central weighted damage.Fig. 4A illustrates the curve of bottleneck dimensionality C, showing a decreasing trend in model loss.A turning point is observed at C = 5, indicating that the network model has acquired sufficient information.Fig. 4B presents the loss curves for different cluster sizes (K), where the Y-axis represents the losses obtained for various cluster sizes.The curves reveal that loss gradually decreases with an increase in cluster size.Notably, the minimum loss value is observed at K = 5, indicating optimal model performance at this cluster size.Therefore, K = 5 is selected as the optimal cluster size for the dataset.

Performance metric results
In Fig. 5A, B, 5C and 5D, the model constructed in this study are compared with the classical CNN.The accuracy rates are 67.7 % and 88.6 % respectively, sensitivity rates are 70.4 % and 86.5 % respectively, and specificity rates are 68.5 % and 82.7 % respectively.The accuracy, sensitivity, and specificity of the model in this study are significantly higher than those of the CNN model (P < 0.05).

Dataset training curve
Fig. 6 illustrates the training history of the training set within the model.Initially, a declining trend in total loss is observed, followed by a stabilization phase.The cross-rotation consistency loss and overlap loss occupy a relatively smaller portion within the total loss, hence appearing towards the bottom of the network training learning curve.

Performance evaluation 3.4.1. FTIR analysis of four TCM samples
From Fig. 7A, B, 7C and 7D, they are evident that Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var.parvifolium exhibit distinct absorption peaks associated with carbon-oxygen stretching vibrations within the range of 1025~1200 cm − 1 .These peaks, characteristic of strong carbohydrate absorption, indicate a significant presence of cellulose and other carbohydrates in these four TCM samples, which belong to the same botanical family and genus.Additionally, absorption peaks at 3400 cm⁻ 1 were attributed to hydroxyl stretching vibrations.Although the FTIR spectra reveal some variations, they are not substantial.The absorption peak at 3750 cm⁻ 1 is ascribed to the stretching vibration of hydroxyl groups, indicating the presence of hydroxyl functional groups in these herbal samples.Such groups are commonly associated with glucose, carbohydrates, and other oxygen-containing organic compounds, suggesting the presence of polysaccharides.While specific absorption peak characteristics may vary among samples, the consistent presence of the hydroxyl stretching vibration absorption peak at 3750 cm⁻ 1 supports the presence of polysaccharide compounds in these samples.To further validate the authenticity of these TCMs, a model employing clustering algorithms and discrete wavelet transform for classification and authentication is used.In signal processing, DWT decomposes the signal into components of different frequencies, facilitating clustering analysis of these components to identify specific signal patterns.

FSD analysis results
Further analysis using FSD reveals the presence of both saturated and unsaturated C-H bonds within the absorption peak range of 2000-4000 cm⁻ 1 among the four types of TCM.Additionally, OH and NH2 groups are identified within this range.Fourier selfdeconvolution infrared spectra (FSD-IR) analysis is utilized to determine the phylogenetic relationships among the samples based on the FSD-IR results (Fig. 8A, B, 8C and 8D).

DWT results of four TCM samples
The FSD-IR method effectively distinguishes Bupleurum scorzonerifolium from Bupleurum smithii Wolff var.parvifolium, but encounters difficulties in differentiating Bupleurum yinchowense from Bupleurum marginatum.Further analysis involves employing the spectral characteristics of the DWT signals in conjunction with clustering algorithms to assess data, evaluating the efficacy of signal decomposition at various resolutions.Optimal wavelet bases are selected based on characteristic peaks from the original spectra to achieve improved signal smoothing.Five layers of ATR-FTIR spectra decomposition are obtained, excluding the first layer due to excessive noise and signal interference, with identification analysis focused on the remaining layers.The DWT data undergo five compression stages, reducing data volume by half after each stage, enhancing resolution, and revealing previously concealed infrared absorptions in the spectra.In layers 2, 3, and 4, the signals from the four samples exhibit similar shapes.However, significant differences are observed in the infrared spectra between Bupleurum yinchowense and Bupleurum marginatum within the 1000 cm − 1 ~1800 cm − 1 infrared data band range (Fig. 9A, B, 9C and 9D).This demonstrates that the FTIR method, based on DWT combined with clustering algorithms, accurately differentiates Bupleurum scorzonerifolium within the same botanical family and genus, showcasing promising advancements in the precise authentication of TCM.

Cluster analysis
There were observed differences among the ATR-FTIR spectra of the four types of Chinese herbal medicines.By utilizing a clustering Gaussian mixture model, the wavenumber and absorbance data were extracted for statistical clustering analysis.The results of the clustering analysis were then compared with the morphological classification results.A total of twelve ATR-FTIR spectra data points were selected, representing four different types of herbal samples, with three repetitions for each sample.The comparison involved typical absorption peak values and absorbance at different wavenumber ranges.Intelligent algorithms were employed to   analyze the data, and the absorption peak values obtained are presented in Table 1.

Cluster analysis tree spectrum
After undergoing analysis using the PAST statistical software package, the ATR-FTIR spectral data were processed through a series of intelligent algorithms, resulting in a dendrogram that displayed the phylogenetic relationships among Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var.parvifoliumv (see Fig. 10).The graph demonstrated the structural similarities among the four types of TCM, with a similarity exceeding 0.982.The post-grinding treatment maintained consistency in sample properties aligned with their taxonomic classification for these four medicinal herbs.This highlights how the combination of FTIR with clustering analysis accurately and scientifically differentiated Bupleurum scorzonerifolium from its counterparts, including Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var.parvifolium.

Table 1
The absorbance of the main absorption peaks of four TCM samples.

Discussion
This discussion provides further insights into the biological and chemical significance of the spectral differences observed among the four types of Bupleurum, facilitating a deeper understanding of their distinctions.
A more detailed analysis of the FTIR spectra and a comprehensive evaluation of how the observed absorption peaks affect the authenticity of Chinese herbal medicine would enrich the research.
The complex chemical composition and diverse pharmacological effects of Chinese herbal medicine present significant challenges in authentication and quality control.Modern analytical techniques have offered new perspectives on this issue.Among these, FTIR is widely used in herbal medicine analysis due to its operational simplicity, rapid analysis, and rich information content [35].Various methodologies have been employed by researchers to authenticate TCM.For instance, Tan et al. (2015) [36] developed a gas chromatography/mass spectrometry (GC/MS) method based on headspace solvent-free microextraction (HS-SFME) for detecting Angelica sinensis.Zhang et al. (2019) [37] used metabolomics to reveal quality differences between authentic and non-authentic medicinal materials through ultra-high-performance liquid chromatography-quadrupole time-of-flight mass spectrometry (UHPLC-Q-TOF-MS/MS).Liu et al. (2020) [38] applied electronic eye technology to authenticate the quality of Fritillaria cirrhosa.Yang et al. (2021) [39] used liquid chromatography combined with chemometrics to distinguish between Fritillaria unibracteata and Fritillaria ussuriensis, demonstrating its effectiveness in differentiating Lilium purpurea.These studies underscore the effectiveness of integrating scientific techniques with traditional authentication methods in identifying TCM.Clustering analysis, involving computer-assisted processing of drug data, results in various clustering patterns that aid in the categorization and identification of drugs.Many studies have demonstrated the efficacy of clustering analysis in authenticating TCM.Liu et al. (2023) [40] utilized principal component analysis and clustering analysis to differentiate between Fritillaria chuanensis and Siberian fritillary bulb, showing strong linear correlations within this method.Zhang et al. (2021) [41] employed principal component analysis within clustering analysis for quality control of traditional Chinese medicinal materials, achieving 97.5 % accuracy in the established function model.This study uses intelligent algorithm-based clustering analysis, with results indicating a similarity above 0.982 among the four types of TCM, confirming the efficacy of clustering analysis in authentication.This work integrates chemometrics with clustering analysis for TCM authentication, constructing a Gaussian mixture model for algorithm development and applying it to cluster data for the four types of TCM.After optimized network training, the network performance showed favorable outcomes.In terms of dataset parameter selection, the curve of the bottleneck dimension C indicates that the network model acquired sufficient information at C = 5.
The four types of Bupleurum may contain varying types or levels of chemical constituents, such as flavonoids, triterpene saponins, alkaloids, and others, which may influence the pharmacological effects of the herb.This warrants further investigation and discussion.
FTIR spectroscopy offers valuable structural information on complex TCM systems.Typically, the identification of medicinal herbs involves comparing the types and quantities of functional groups.Infrared spectroscopy facilitates the inference of specific structures corresponding to chemical constituents by analyzing the presentation of hydrogen in various environments.This method aids in distinguishing chemical samples in TCM, including oils and fats.
FTIR technology is a widely utilized analytical method that provides insights into the molecular structure and chemical composition of samples.In the authentication of Chinese herbal medicine, a thorough interpretation of FTIR spectra enhances the understanding of sample differences and assesses their impact on the medicine's authenticity.Detailed analysis of absorption peaks observed in FTIR spectra is crucial, as each peak represents the vibration or stretching of different chemical bonds or functional groups, thus offering information about the sample's composition and structure.By comparing the positions, intensities, and shapes of these absorption peaks across different samples, distinctions between them can be revealed, facilitating authenticity assessment.The chemical composition of the sample can be inferred from the characteristics of the absorption peaks.For instance, absorption peaks within specific wavenumber ranges may indicate the presence of particular chemical constituents such as hydroxyl groups, carbonyl groups, and aromatic compounds.Comparative analysis of absorption peaks helps in assessing the similarity and dissimilarity between samples, thereby aiding in authenticity determination.Factors influencing FTIR spectra, such as sample preparation methods, storage conditions, and sampling positions, must be considered as they may affect the spectra and lead to changes in observed absorption peaks.Therefore, comprehensive consideration of these factors is necessary when interpreting FTIR spectra, with appropriate adjustments and corrections made as needed.in this study, infrared spectroscopy analysis was performed on Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var.parvifolium.Combined with DWT and clustering algorithms, successful discrimination among these four herbal medicines was achieved.Within the 1025-1200 cm^-1 region, different peaks corresponding to carbon-oxygen bond stretching vibrations were observed, indicating the presence of carbohydrates in these samples.The intensity and position of these absorption peaks may relate to the chemical composition and structure of the samples.In the 2000-4000 cm^-1 range, absorption peaks associated with saturated and unsaturated C-H bonds, as well as functional groups such as OH and NH2, were observed, providing information about the samples' composition.Comparative analysis of infrared spectra revealed significant differences between the spectra of Bupleurum yinchowense and Bupleurum marginatum and those of other samples.These differences may reflect variations in chemical composition and biological activity among these herbal medicines.For example, different varieties may contain varying types or amounts of active ingredients or exhibit different metabolic pathways and growth environments.Sugars in medicinal herbs are particularly amenable to identification using infrared spectroscopy.This study employed DWT, FTIR, and FSD for detailed identification of Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var.parvifolium.Infrared spectroscopy analysis revealed distinct absorption peaks within specific energy bands for these TCM types, notably in the stretching vibrations of carbon-oxygen bonds and hydroxyl groups, indicating significant differences.The combination of DWT and clustering algorithms achieved successful differentiation of the four Bupleurum species, demonstrating high accuracy in herbal medicine identification.The comprehensive structural information analysis Y. Bai and H. Zhang Heliyon 10 (2024) e37479 showed a high similarity exceeding 0.982 among the four herbal medicines, highlighting the effectiveness of integrating chemometrics with intelligent clustering analysis for precise Traditional Chinese Medicine identification.In summary, the combination of infrared spectroscopy analysis, DWT, and clustering algorithms exhibits high accuracy in distinguishing Bupleurum scorzonerifolium, Bupleurum yinchowense, Bupleurum marginatum, and Bupleurum smithii Wolff var.parvifolium.These differences may be related to the chemical composition and structure of the samples, and further biological or chemical research can elucidate the specific reasons behind these differences, providing a deeper understanding and basis for the authenticity identification of Chinese herbal medicine.

Conclusions
In this study, the authentication of TCM was conducted using chemometrics combined with clustering analysis.An effective model was achieved through network training by optimizing cluster sizes and varying learning rates.Infrared spectroscopy analysis revealed that four types of TCM exhibited distinct absorption peaks within specific frequency ranges, and these types were accurately distinguished by the clustering algorithm.The integration of chemometrics with intelligent algorithms in clustering analysis provides a precise method for TCM authentication, enhancing the reliability and accuracy of authentication and quality control.This approach broadens the analytical techniques available for TCM authentication.

Research contribution
This study contributes methodologically by advancing TCM authentication techniques through a comprehensive analytical approach.By combining chemometrics with clustering analysis, this research offers an integrated examination of TCM component characteristics, providing a more detailed information set for authenticity determination.It introduces a novel approach to TCM authentication, demonstrating significant advantages over traditional methods due to its multi-index and comprehensive nature.This innovation promises to advance TCM authentication techniques, guiding the field toward a more scientific and systematic direction and offering the TCM industry more reliable quality assurance methods.

Future works and research limitations
Future research should focus on optimizing methodological details.Further refinement of the techniques integrating chemometrics with clustering analysis is necessary to improve the precision and sensitivity of the analyses.Expanding the scope of research samples will validate the applicability of this method across various TCM types and sources, ensuring its generalizability.Further exploration of the interrelationships among TCM components and their impact on authenticity identification will provide theoretical support for refining the method.
Limitations of this study include potential constraints within the sample set, which may not fully represent the complexity of constituent compositions in different TCM types.Expanding the sample range is essential.Additionally, the reliability and accuracy of data acquisition, crucial for the integration of chemometrics with clustering analysis, could impact research outcomes due to data quality issues.Given the relative novelty of this methodology, its stability and feasibility in practical applications require further validation through extensive practical testing.In conclusion, while this study introduces innovative perspectives on TCM authentication, further work is needed to optimize methodologies, expand sample sizes, and explore the underlying mechanisms involved.

Fig. 2 .
Fig. 2. Flowchart depicting the process of local structure identification and training strategy for the model.

Fig. 4 .
Fig. 4. Results of dataset parameter selection.Note: (A) Curve of bottleneck dimension C; (B) Curve of cluster size K.

Fig. 6 .
Fig. 6.Training curve of the dataset.(Note: Wp represents reconstruction loss; Wr represents weighted center reconstruction loss; Wn represents overlap loss; Wc represents crossrotation consistency loss; We represents regularization loss.).