EfficientMaize: A Lightweight Dataset for Maize Classification on Resource-Constrained Devices

Hyperspectral imaging, combined with deep learning techniques, has been employed to classify maize. However, the implementation of these automated methods often requires substantial processing and computing resources, presenting a significant challenge for deployment on embedded devices due to high GPU power consumption. Access to Ghanaian local maize data for such classification tasks is also extremely difficult in Ghana. To address these challenges, this research aims to create a simple dataset comprising three distinct types of local maize seeds in Ghana. The goal is to facilitate the development of an efficient maize classification tool that minimizes computational costs and reduces human involvement in the process of grading seeds for marketing and production. The dataset is presented in two parts: raw images, consisting of 4,846 images, are categorized into bad and good. Specifically, 2,211 images belong to the bad class, while 2,635 belong to the good class. Augmented images consist of 28,910 images, with 13,250 representing bad data and 15,660 representing good data. All images have been validated by experts from Heritage Seeds Ghana and are freely available for use within the research community.


Subject
Applied Machine Learning Specific subject area Seed Classification Data format Raw, Augmented Type of data Maize seed Images Data collection The dataset consists of three types of maize seeds, namely Wang Dataa, Sanzal Sima, and Bihilifa, all obtained from Heritage Seeds Ghana.These maize varieties are commonly grown in the northern region of Ghana.At the collection point, Heritage Seeds Ghana manually sorted and labelled the images as either good or bad for each of the three varieties.The images were captured using a 12-megapixel phone camera, and the original jpeg images had varying dimensions.To ensure uniformity and clarity, a blue background was used during the image capture process.Finally, the images were organized into their respective classes of good and bad.

Value of the Data
• The dataset comprises a total of 28,910 augmented images and 4846 raw images, encompassing three varieties of maize, each categorized into two classes: good and bad.• It includes maize seeds suitable for production and those unsuitable for crop yield or planting.• This dataset proves valuable for developing applications focused on maize classification, detection, and recognition.• It is particularly beneficial for training, testing, and validating high-quality maize seeds to enhance crop yield and for constructing classification and identification models.• The dataset serves as a valuable resource for creating a maize classification tool optimized for resource-constrained devices.• It will facilitate the development of an application dedicated to classifying, identifying, and detecting maize seeds.This application can be utilized by farmers, agricultural extension officers, the Ministry of Food and Agriculture (MoFA), and other relevant agencies.

Data Description
The bedrock of human existence, and indeed the very survival of humanity, is agriculture.With over 7 billion people already inhabiting the planet, the next few decades will see a further 2.5 billion individuals added to the global population, most of whom will live in cities situated in Asia and Africa [2] .Agriculture plays an incontrovertibly decisive role in the continued existence of humankind.The effective and precise management of agricultural processes and resources will determine the welfare and prosperity of people across the world, securing a sustainable future for generations to come [3] .Together with air and water, sustenance is a fundamental prerequisite for the sustenance of human beings.Nonetheless, the risk of food scarcity looms as agricultural land is usurped by urbanization and the global population burgeons [3] .According to the Food and Agriculture Organization (FAO) of the United Nations, smart Agriculture, also known as precision agriculture, is the use of advanced technologies such as sensors, drones, robotics, and artificial intelligence (AI) to optimize agricultural practices [4] Maximizing crop yield, reducing costs, and maintaining healthy ecosystems are among the primary objectives of agricultural production.Maize (Zea mays L.) is a crucial crop that is produced and consumed by most farming households in Ghana, according to the Ministry of Food and Agriculture [5] .According to Heritage Seeds Ghana, accurate classification of the maize seed is essential for cultivation and marketing purposes.Hyperspectral imaging with deep learning has been employed for maize classification, ensuring quality production, seed quality control, and impurity identification.Zhou et al. for instance utilized LesNet-5 to improve maize classification [6].In a study by Sang et al. [7] , they introduced a lightweight CNN architecture based on the MobileNetV2 accelerator for real-time seed sorting.The MobileNetV2 [8] model showed high accuracy, reaching 97.91% for red kidney bean classification and 96.50% for maize seed classification, demonstrating its effectiveness in accurate and efficient real-time seed sorting.Xu et al. [9] presented an enhanced CNN architecture that combined deep learning and machine vision techniques for classifying maize seeds of different varieties.The improved architecture achieved an impressive average classification accuracy of 99.70% on a dataset of 8080 maize seeds from five varieties.Wang and Su [10] conducted a study that presented a detailed exploration of innovative Convolutional Neural Network (CNN) models integrated with computer vision (CV) techniques for detecting phenotypes in grain crops.Qiu et al. [11] investigated using hyperspectral imaging with Convolutional Neural Network (CNN) for rice seed variety identification.The study compared CNN's performance against K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) models in the same task.In the research conducted by G. Qiu et al. in 2019, the application of Fourier Transform Near-Infrared (FT-NIR) spectroscopy combined with discriminant analyses was explored as a rapid and nondestructive method for classifying sweet corn seed cultivars [12] .Bai et al. [13] studied the classification of eight maize seed varieties, including common and silage maize, using Support Vector Machine (SVM) and Radial Basis Function Neural Network (RBFNN) models.The models achieved high accuracies, with over 86% for the direct classification of all eight varieties and over 88% for distinguishing between common and silage maize seeds.The classification accuracy for silage maize seeds surpassed 98%, while for common maize seeds, it exceeded 97%.Despite the recent improvements, Maize seeds are still manually graded and sorted, and this practice continues to this day in Ghana.The manual grading and sorting of maize seeds are a labor-intensive process and require skilled workers who can differentiate between good and bad seeds [14] .The data source, number of variables, variable types and a sample of raw images of the various datasets utilized in the works explored have been summarized in Table 1 .
The dataset associated with this work contains raw ( 4,846 images ) and Augmented ( 28,910 images ) color images with blue background and two classes.The raw images are varied in sizes of 76 × 76, 73 × 73, 102 × 102, 75 × 75, 104 × 104, 64 × 64 etc.The augmented images are compressed to a size of 128 × 128.The parameters used to augment the raw data have been outlined in Table 3 .This paper provides a dataset for deep learning classification, detection, and recognition tasks for single and multiple models.Having high-resolution and perfect lighting images can increase models' performance but increase computational costs.Yet the optimal selection of image resolution and model definition can increase neural network performance for various image processing tasks.The raw and augmented images are presented in two folders, good and bad.The raw images can be downloaded as a 6.72 MB file EfficientMaize.The Augmented dataset can be downloaded as a 51.1 MB file Augmented EfficientMaize.The bad folder contains images of bad maize seeds while the good contains images of quality maize seeds for production and feeding.Figure 1 depicts the images captured before cropping and labeling.
Table 2 shows the specification of the camera used in capturing the dataset.In order to train a deep learning model to identify images under resource constraint conditions, low-quality images are used as input.To achieve this, a 12-megapixel camera was employed to capture the images.To prepare the images for further processing, they were resized to a standard size of 640 × 480 pixels using a Python script.We defined the Image extractor tool to crop the grouped images into their respective classes.The resulting cropped images were deemed satisfactory in terms of quality and suitability for model building.The next step involves labeling the images    based on their quality, distinguishing between good and bad seeds.This labeling process was accomplished using the Image Extractor tool.The labeled images were then organized into separate datasets for the good and bad seed categories.The seeds before the image capturing were already grouped by experts from heritage seeds Ghana as good and bad seeds.And these experts also verified the images and labeled them over 2 weeks to ensure their validity and accuracy.respectively.Table 4 provides details on images such as total images in each class, image size, background, etc.
Based on the distributions, we can observe that in both the raw data and augmented data, 46% of the total data belongs to the bad class, while the remaining 54% belongs to the good class.This distribution indicates that the dataset is relatively balanced between the two classes, with a slightly higher representation of good maize seeds compared to bad seeds.The directory structure of both raw and augmented data is shown in Figure 5 .

Experimental Design, Materials and Methods
Figure 7 shows the image data acquisition process.The image processing procedure begins with the collection of raw data, which includes three types of commonly planted maize seeds in the northern region of Ghana: Wang Dataa, Sanzal Sima, and Bihilifa, categorized as either good or bad.Images from the raw data were captured using an iPhone 11 Pro Max with a setup as shown in Figure 6 .These images were taken in groups on an A4 sheet with a blue background as indicated in Figure 6 and Figure 1 to ensure consistency and clarity during the daytime, without particular attention to lighting conditions.Subsequently, a Python script resized the images to a standard size of 640 by 480 pixels which was required by the image extractor tool for further cropping and labeling.The Image Extractor tool was employed to crop individual images from the grouped ones, and the resulting cropped images were deemed suitable for model building in terms of quality.Next, the images were labeled based on their quality, distinguishing between good and bad seeds.This labeling process involved using the Image Extractor tool, with the assistance of a specialist from Heritage Seeds Ghana.The labeled images were then organized into separate folders for the good and bad seed categories.The good category represented high-quality maize seeds in terms of yield, while the bad category comprised damaged, infected, or low-quality seeds collected from the Headquarters of Heritage Seeds Ghana.The resulting cropped images had varying sizes.Preprocessing steps included image cropping to focus on the regions of interest (ROI).The acquisition of the dataset occurred during the maize harvest season in Ghana, spanning from December 2022 to January 2023.Daily captures were made during daylight hours within this period.Table 5 provides the timeline details of the dataset acquisition process.The folder structure of the images is shown in Figure 5 The dataset consists of only good and bad maize seeds.

Limitations
Different backgrounds would be ideal to test all scenarios since we want to achieve low processing and computing power on the images for the purpose of classification.Notwithstanding, other backgrounds were blurring our images and causing disparity between good and bad seeds making it difficult to identify them from the captured images.The cropping of the images also led to some few images with non-uniform background.

Figure 1 .
Figure 1.Grouped capture of maize seeds for the three varieties.

Figure 2 (
a) and (b) depict the labeled and grouped images into their respective classes as good and bad.The total distribution of the raw and augmented data is depicted in Figures3 and 4

Figure 2 .
Figure 2. Individual Seeds after cropping and labelling as good and bad.

Figure 3 .
Figure 3. Class distribution for the raw dataset.

Figure 4 .
Figure 4. Class distribution for the augmented dataset.

Table 1
Comparison of previous works.

Table 2
Camera specification table.

Table 3
Data Augmentation Parameters.

Table 4
Details of lightweight maize Dataset.