Image dataset on the Chinese medicinal blossoms for classification through convolutional neural network

Tree blossoms have been widely used on the prevention and treatment of a variety of diseases in traditional Chinese medicine for thousand years [1,2]. The growth of flowers is not only for their ornamental value, but also for nutritional, medicinal, cooking, cosmetic and aromatic properties. They are a rich source of many compounds, which play an important role in various metabolic processes of the human body [3]. Edible flowers can promote the global demand for more attractive and delicious food, and can improve the nutritional content of gourmet food [4]. Flowers are beneficial for anti-anxiety, anti-cancer, anti-inflammatory, antioxidant, diuretic and immune-modulator, etc. It is very important to identify edible flowers correctly, because only a few are edible [5]. The shapes or colors of different flowers may be very similar. Visual evaluation is one of the classification methods, but it is error-prone and time-consuming [6]. Flowers are divided into flowers from herbaceous plants (flower) and flower trees (blossom). Now there is a public herbaceous flower dataset [7], but lack of dataset for Chinese medicinal blossoms. This article presents and establishes the dataset for twelve most commonly and economically valuable blossoms used in traditional Chinese medicine. The dataset provide a collection of blossom images on traditional Chinese herbs help Chinese pharmacist to classify the categories of Chinese herbs. In addition, the dataset can serve as a resource for researchers who use different algorithms of machine learning or deep learning for image segmentation and image classification.


a b s t r a c t
Tree blossoms have been widely used on the prevention and treatment of a variety of diseases in traditional Chinese medicine for thousand years [1 , 2] . The growth of flowers is not only for their ornamental value, but also for nutritional, medicinal, cooking, cosmetic and aromatic properties. They are a rich source of many compounds, which play an important role in various metabolic processes of the human body [3] . Edible flowers can promote the global demand for more attractive and delicious food, and can improve the nutritional content of gourmet food [4] . Flowers are beneficial for antianxiety, anti-cancer, anti-inflammatory, antioxidant, diuretic and immune-modulator, etc. It is very important to identify edible flowers correctly, because only a few are edible [5] . The shapes or colors of different flowers may be very similar. Visual evaluation is one of the classification methods, but it is error-prone and time-consuming [6] . Flowers are divided into flowers from herbaceous plants (flower) and flower trees (blossom). Now there is a public herbaceous flower dataset [7] , but lack of dataset for Chinese medicinal blossoms. This article presents and establishes the dataset for twelve most commonly and economically valuable blossoms used in traditional Chinese medicine. The dataset provide a collection of blossom images on traditional Chinese herbs help Chinese pharmacist to classify the categories of Chinese herbs. In addition, the dataset can serve as a resource for researchers who use different algorithms of machine learning or deep learning for image segmentation and image classification.
© 2021 The Author(s

Value of the Data
• The dataset provide a collection of blossom images on traditional Chinese herbs help Chinese pharmacist to classify the categories of Chinese herbs. • This dataset can be used not only as an atlas of botany, but also as a training material for Chinese medicine courses. • This dataset contribute the expansion of blossom images on traditional Chinese herbs.
• Blossom image data help researchers to understand the performance of new algorithms for object detection and image segmentation.

Image acquisition
Of all the 57 types flower Chinese herbal medicines, there are 12 trees, 9 shrubs, 8 small trees, and 29 herbs. This study selects and establishes the dataset for twelve most commonly and economically valuable tree blossoms used in traditional Chinese medicine. Blossom images were captured through public dataset, personal blog, and government website, etc.

Image preprocessing
We evaluated the blossom images by cropping letters and frames, deleting handwriting and blurred images, centering the blossoms, and adjusting the length and width. The number of images in each category is outlined as follows: (1)

Image partition
We amassed a total of 1716 original images in twelve categories. The images were randomly chosen to be divided into training, validation, and test subsets at 80:10:10 ratio for each category. For example, the numbers of training, validation, and test images for Syringa are 153, 19, and 19, respectively. The total number of original images for training, validation, and test subsets were 1376, 170 and 170, respectively.

Image augmentation
Data augmentation creates image diversity to enhance performance of classification models. There are many augmentation methods [9] , and the benefits may differ from augmentation methods and data characteristics. We select Gaussian filtering, image brightness augmentation, image brightness reduction, mirror rotation, noise increase, 90 °rotation, and 180 °rotation methods; eight methods in total. Data augmentation was applied in the training and validation datasets. Images were increased to eight times. Fig. 3 shows an example of the original image and the images obtained after data augmentation. Table 1 presents the number of training, validation, and test images before and after data augmentation. Fig. 4 represents the architecture of the dataset.

Image classification
CNN models are the most commonly used for image classification. We selected AlexNet and InceptionV3 models to identify the categories for twelve traditional Chinese medicinal blossoms. Krizhevsky et al. [10] proposed the AlexNet model in 2012. The AlexNet model architecture exhibits eight layers; the first five layers are convolutional layers and the last three layers are fully connected layers. To be more computational efficient, techniques commonly used in In-ceptionV3 include factorized convolutions, regularization, dimension reduction, and parallelized computations. Tables 2 and 3 showed the results of these two classification models for the datasets before and after data augmentation. Before data augmentation, the accuracy, precision, recall, F1-score, and training time of AlexNet were 93.57%, 92.98%, 94.52%, 93.62%, and 0 h 1 min 17 s, respectively; the accuracy, precision, recall, F1-score, and training time of InceptionV3 were 89.18%, 88.21%, 90.06%, 88.79%, and 0 h 8 min 14 s, respectively. After data augmentation, the accuracy, precision, recall, F1-score, and training time of AlexNet were 98.53%, 98.41%, 98.50%, 98.45%, and 0 h 9 min 26 s, respectively; the accuracy, precision, recall, F1-score, and training time of InceptionV3 were 98.61%, 98.61%, 98.55%, 98.58%, and 1 h 5 min 51 s, respectively. Fig. 5 represents the training curves for the two models for dataset before and after data augmentation.

Ethics Statement
This study did not conduct experiments involving humans and animals.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.