NB-TCM-CHM: Image dataset of the Chinese herbal medicine fruits and its application in classification through deep learning

Chinese herbal medicine (CHM) is integral to a traditional Chinese medicine (TCM) system. Accurately identifying Chinese herbal medicine is crucial for quality control and prescription compounding verification. However, with many Chinese herbal medicines and some with similar appearances but different therapeutic effects, achieving precise identification is a challenging task. Traditional manual identification methods have certain limitations, including labor-intensive, inefficient. Deep learning techniques for Chinese herbal medicine identification can enhance accuracy, improve efficiency and lower coats. However, few high-quality Chinese herbal medicine datasets are currently available for deep learning applications. To alleviate this problem, this study constructed a dataset (Dataset 1) containing 3,384 images of 20 common Chinese herbal medicine fruits through web crawling. All images are annotated by TCM experts, making them suitable for training and testing Chinese herbal medicine identification methods. Furthermore, this study establishes another dataset (Dataset 2) of 400 images by taking pictures using smartphones to provide materials for the practical efficacy evaluation of Chinese herbal medicine identification methods. The two datasets form a Ningbo Traditional Chinese Medicine Chinese Herb Medicine (NB-TCM-CHM) Dataset. In Dataset 1 and Dataset 2, each type of Chinese medicine herb is stored in a separate folder, with the folder named after its name. The dataset can be used to develop Chinese herbal medicine identification algorithms based on deep learning and evaluate the performance of Chinese herbal medicine identification methods.

a b s t r a c t Chinese herbal medicine (CHM) is integral to a traditional Chinese medicine (TCM) system.Accurately identifying Chinese herbal medicine is crucial for quality control and prescription compounding verification.However, with many Chinese herbal medicines and some with similar appearances but different therapeutic effects, achieving precise identification is a challenging task.Traditional manual identification methods have certain limitations, including laborintensive, inefficient.Deep learning techniques for Chinese herbal medicine identification can enhance accuracy, improve efficiency and lower coats.However, few high-quality Chinese herbal medicine datasets are currently available for deep learning applications.To alleviate this problem, this study constructed a dataset (Dataset 1) containing 3,384 images of 20 common Chinese herbal medicine fruits through web crawling.All images are annotated by TCM experts, making them suitable for training and testing Chinese herbal medicine identification methods.Furthermore, this study establishes another dataset (Dataset 2) of 400 images by taking pictures using smartphones to provide materials for the practical efficacy evaluation of Chinese herbal medicine identification methods.The two datasets form a Ningbo Tradi-tional Chinese Medicine Chinese Herb Medicine (NB-TCM-CHM) Dataset.In Dataset

Background
Chinese herbal medicine is an essential part of TCM, and the correct use of Chinese herbal medicine, including identification and categorization, is crucial to the life safety of patients [1] .In recent years, researchers have gradually used deep learning to recognize Chinese herbal medicines due to significant breakthroughs in image recognition and analysis.Wang et al. [2] constructed a deep convolutional neural network combining the channel and spatial attention modules for the recognition of Chinese herbal medicine.Gang et al. [3] designed a lightweight convolutional neural network for the identification of Chinese herbal medicines and achieved good classification accuracy.Xu et al. [4] used multiple attentional pyramid networks to recognize Chinese herbs.However, although some studies have used deep learning methods to recognize Chinese herbal medicines, the datasets used in these researches are self-built, and the relevant datasets are private.Therefore, comparing the performance differences between these methods on the same dataset is difficult.Meanwhile, the public dataset is an essential factor that promotes the development of related research, so establishing a public dataset of Chinese herbal medicine is significant.

Data Description
This study has constructed two datasets containing 20 types of Chinese herbal medicine fruits for research on the automatical identification of Chinese herbal medicine.The first dataset (Dataset 1) was obtained through Google search and consisted of 3384 images.The second dataset (Dataset 2) was acquired through taking pictures of Chinese herbal medicine using a smart phone, totaling 400 images.
In addition, images of the above 20 types of Chinese herbal medicine were captured using smartphones, totaling 400 images.Each type of herb has 20 images, forming Dataset 2. After taking all the images, we screened them and removed unclear images.Table 1 shows examples and quantities of each class of data.
The two datasets form a Ningbo Traditional Chinese Medicine Chinese Herb Medicine (NB-TCM-CHM) Dataset.Within both Dataset 1 and Dataset 2, there are 20 folders each.Each type of Chinese herbal medicine is stored in a separate folder, and the folder is named after the herb's name.

Experimental Design, Materials and Methods
Accurate identification and utilization of Chinese herbal medicine are crucial for ensuring the efficacy and safety of traditional Chinese medicine treatments [ 2 , 5 , 6 ].In recent years, deep learning has achieved outstanding results in image processing.Using deep learning methods can automate the identification of Chinese herbal medicine, thereby improving the accuracy of recognizing Chinese herbal medicine.However, publicly available data are currently scarce.Therefore, we have established a Chinese herbal medicine fruits dataset for training and evaluating deep learning models.
In Fig. 1 , we show the processing and usage methods of the dataset, which consists of four parts: image acquisition, data preprocessing, model training using Dataset 1, and model test.The image data is obtained through Google search and camera capture.Specifically, firstly, Dataset 1 was obtained through a Google search, and the "Usage rights'' search tool was used to collect these images during the search.These images are publicly accessible and can be collected.The process was automated using web crawler technology, and after the initial collection, the images were screened by a professional Chinese medicine practitioner to remove the wrong images.Finally, a total of 3384 images of herbal medicines were obtained.A smartphone was used to   photograph the herbs when collecting the raw data of Dataset 2. White A4 paper was chosen as the background to minimize interference and enhance the visibility of the subject (Chinese herbal medicines), and the phone was placed on a stand with LED lights for fill light, keeping the light intensity constant during the shooting process.After the shooting, the images were screened, and the unclear images were removed.Finally, a total of 400 images of Chinese herbs were obtained.After obtaining the images, preprocessing operations are performed on them.Specifically, firstly, the images were screened to remove unclear images.Subsequently, redundant parts of the images were cropped out to remove the interference of background information.Then, the data images obtained through Google search were set as Dataset 1, and the images taken using a smartphone were designated as Dataset 2. Finally, Dataset 1 was divided into training, validation, and test sets.The test set consists of 20 images randomly selected from each category of images.Dataset 2 is all used to test the model performance.
Data enhancement is an essential data preprocessing method.Data augmentation aims to expand the size of the training dataset by making minor changes to existing data [7] .This is an essential step in the deep learning model training process, which can alleviate the problem of insufficient training data, thus reducing the overfitting phenomenon in deep learning model training and improving the neural network's performance.Three data enhancement methods were used in this study: random flipping, random cropping, and rotation.First, the images were uniformly resized to 128 × 128.Then, the images were flipped horizontally on the X and Y axes with a probability of 0.5, respectively.Finally, the image was randomly rotated using the center of the image as the rotation axis.In this experiment, a dynamic data enhancement technique is used.The images are randomly transformed before they are fed into the network for training.
Convolutional Neural Networks (CNN) are among the most common methods in deep learning and have achieved excellent results in image classification.To examine the potential of the dataset in training deep learning models, we used eight classic deep learning models to classify these 20 types of Chinese herbal medicine fruits.These eight deep learning models are VGG16 [8] , ResNet18 [9] , InceptionV3 [10] , ShuffleNetV2 [11] , SqueezeNet [12] , MobileNetV2 [13] , MobileNetV3 [14] , and MnasNet [15] .We used four indicators to evaluate the classification performance of the models: AUC, F1-Score, Recall, and Accuracy.Table 2 shows their classification results.
This study used the PyTorch 2.1 deep learning framework to implement these eight deep learning models.A computer with an Intel i9-10900X CPU Central Processing Unit (CPU) and NVIDIA RTX3090 Graphics Processing Unit (GPU) configuration was used to complete the training and testing of all models.For all classification models, the input image size was set to (128 ×128 ×3).During the training process, the initial learning rate was set to 0.0 0 01, the batch size was 64, the epoch was 200, and the optimizer for model training used Adam's algorithm.In addition, Pytorch's ReduceLROnPlateau method was used to adjust the learning rate during training.After completing the training of the models, the model weights that performed best on the validation set were saved for testing.The above settings were kept constant while training all models.
As can be seen from Table 2 , among all the models, MobileNetV3 achieved the best classification performance with an AUC of 0.9973, F1 of 0.9658, Recall of 0.9664, and Accuracy of 0.9710.The recognition rate of InceptionV3 was the lowest, with an Accuracy of 0.9324.In addition, we also used Dataset 2 as a test set to evaluate the above-trained deep learning models, and Dataset 1 was used as a training set.Table 3 displays the results of this test.As can be seen from Table 3 , MobileNetV2 achieved the best classification performance with an Accuracy of 0.8975.At the same time, all the models achieved satisfactory results, with the accuracy of all models exceeding 80 %, except for SqueezeNet.There have been some studies focusing on Chinese herbal medicine datasets.For example, Huang et al. [16] constructed a dataset of floral images of Chinese herbal medicines, which can be used to train deep learning models.The dataset contains 12 common floral Chinese herbal medicines, with 1716 original images.Fang et al. [17] built a high-throughput experimental and reference-guided database for Chinese herbal medicines, which can promote the scientific research and drug development of Chinese medicines.Xue et al. [18] constructed a comprehensive Chinese herbal medicines database to connect TCM with modern Western medicine, especially at the molecular level, to understand the mechanism of action of TCM.Although some studies have been conducted to establish databases of Chinese herbal medicines, image datasets of TCM that can be used for deep learning are still scarce.In this study, image data of 20 common fruitbased TCMs were collected, with a total of 3784 images (Dataset 1: 3384, Dataset 2: 400).It was experimentally confirmed that this data could be used for deep learning model training as well as testing.This data is significant for the automatic identification of Chinese herbal medicine.

Limitations
We have constructed a dataset of Chinese herbal medicine fruits, comprising 20 common types of herbs, suitable for the training and evaluation of deep learning models.However, the dataset still has some limitations.Firstly, the number of images for each type of herb in the dataset is inconsistent, which might pose challenges to the model's training.Secondly, while there is a wide variety of Chinese herbal medicine, we only selected 20 common Chinese herbal medicine fruits.We will continue to establish Chinese herbal medicine datasets in the future, gradually increasing the variety and volume of herbal data.

Ethics Statement
This study did not conduct experiments involving humans and animals.To collect Google data, we used the "Usage rights" search tool to collect these images.These images are all publicly accessible and can be collected.Furthermore, we applied edits such as cropping to research needs for every image we acquired.After these modifications, the images now differ significantly from their original forms.The authors confirm that the provided data set and presented work strictly meet the ethics requirements for publication in Data in Brief as mentioned in https://www.elsevier.com/authors/journal-authors/policies-and-ethics .
1 and Dataset 2, each type of Chinese medicine herb is stored in a separate folder, with the folder named after its name.The dataset can be used to develop Chinese herbal medicine identification algorithms based on deep learning and evaluate the performance of Chinese herbal medicine identification methods.
First, images of each type of Chinese herbal medicine are searched and downloaded via Google.Then, they are cropped to remove redundant information and unclear images are deleted.Finally, all the images are screened and annotated by professional TCM pharmacists.In addition, we also captured images using smartphones and filtered them, removing any that were unclear.

Table 1
Examples of 20 types of Chinese herb medicine fruits.

Table 1 (
continued ) ( continued on next page )

Table 2
The classification results of different deep learning models on Dataset 1.

Table 3
Test results on Dataset 2 using deep learning models trained on Dataset 1.