CocoaMFDB: A dataset of cocoa pod maturity and families in an uncontrolled environment in Côte d'Ivoire

Cocoa cultivation is the basis for chocolate production; it has a unique aroma that makes it useful in the production of snacks and usable for cooking or baking. The maximum harvest period of cocoa is normally once or twice a year and spread over several months, depending on the country. Determining the best harvesting period for cocoa pods plays a major role in the export process and the pods quality. The degree of ripening of the pods affects the quality of the resulting beans. Also, unripe pods do not have enough sugar and may prevent proper bean fermentation. As for too-mature pods, they are usually dry, and their beans may germinate inside the pods, or they may develop a fungal disease and cannot be used. Computer-based determination of the ripeness of cocoa pods throughout image analysis could facilitate massive cocoa ripeness detection. Recent technological advances in computing power, communication systems, and machine learning techniques provide opportunities for agricultural engineering and computer scientists to meet the demands of the manual. The need for diverse and representative sets of pod images is essential for developing and testing automatic cocoa pod maturity detection systems. In this perspective, we collected images of cocoa pods to set up a database of cocoa pods of the Côte d'Ivoire named CocoaMFDB. We performed a pre-processing step using the CLAHE algorithm to improve the quality of the images since the effect of the light was not controlled on our data set. CocoaMFDB allows the characterization of cocoa pods according to their maturity level and provides information on the pod family for each image. Our dataset comprises three large families, namely Amelonado, Angoleta, and Guiana, grouped into two maturity categories: the ripe and unripe pods. It is, therefore, perfect for developing and evaluating image analysis algorithms for future research.


a b s t r a c t
Cocoa cultivation is the basis for chocolate production; it has a unique aroma that makes it useful in the production of snacks and usable for cooking or baking. The maximum harvest period of cocoa is normally once or twice a year and spread over several months, depending on the country. Determining the best harvesting period for cocoa pods plays a major role in the export process and the pods quality. The degree of ripening of the pods affects the quality of the resulting beans. Also, unripe pods do not have enough sugar and may prevent proper bean fermentation. As for toomature pods, they are usually dry, and their beans may germinate inside the pods, or they may develop a fungal disease and cannot be used. Computer-based determination of the ripeness of cocoa pods throughout image analysis could facilitate massive cocoa ripeness detection. Recent technological advances in computing power, communication systems, and machine learning techniques provide opportunities for agricultural engineering and computer scientists to meet the demands of the manual. The need for diverse and representative sets of pod images is essential for developing and testing automatic cocoa pod maturity detection sys-tems. In this perspective, we collected images of cocoa pods to set up a database of cocoa pods of the Côte d'Ivoire named CocoaMFDB. We performed a pre-processing step using the CLAHE algorithm to improve the quality of the images since the effect of the light was not controlled on our data set. Co-coaMFDB allows the characterization of cocoa pods according to their maturity level and provides information on the pod family for each image. Our dataset comprises three large families, namely Amelonado, Angoleta, and Guiana, grouped into two maturity categories: the ripe and unripe pods. It is, therefore, perfect for developing and evaluating image analysis algorithms for future research. ©

Value of the Data
• Images of cocoa pods obtained will be used in identifying cocoa types and varieties.
• These data are valuable to research activities whose aim is to monitor the link between a cocoa pod and beans. • These data are a baseline for visualization under machine learning to get features in cocoa pod families identification and recognition. • Harvests optimization is going to be practicable with a classification of the state of cocoa pods maturity. • The segmentation of cocoa pods would allow the automatic location of cocoa pods in a given environment.

Objective
Côte d'Ivoire is the first cocoa producer in the world, and cocoa farming represents 14% of its GDP. In order to have better quality cocoa beans, it is necessary to shape the best cocoa pods harvesting period with regard to the state of ripe. Thus, unripe pods do not have enough sugar for a proper beans fermentation. As for overripe cocoa pods, dryness, and germination generally lead beans to development of fungal infections and unitability. This dataset would allow the set-up of automatic systems for harvest optimization.

Data Description
Cocoa pods were photographed with a Nikon D500 and Infinix cameras in a cocoa plantation located in Yakassé 1 village a few km from Grand Bassam in Côte d'Ivoire. The shots of these images were taken at different times of the day, in the morning between 9:30 a.m. and 11 a.m., and in the evening between 2 p.m. and 4 p.m., from different angles in an uncontrolled environment. The images of the pods were classified by category and family. The categories used are ripe cocoa pods and unripe cocoa pods. For each category, we obtained three main families which are: Amelonado, Angoleta, and Guiana.

• Amelonado
Amelonado comes from the Spanish word for melon-shaped. The fruits look like elongated melons, usually with a slight bottleneck. They have thick, usually smooth shells, rarely with some warts. The fruit has shallow grooves and a rounded tip. The Amelonado is a type of Theobroma cacao [1] .  • Angoleta The cocoa Angoleta is a derivative of the trinitario cocoa. It is the result of the crossing between criollo and forastero. Its shape is almost identical to that of a criollo, its surface is very rough, without bottleneck, it is a large fruit, with round seeds, and grooves in the endosperm of light purple color and has a superior quality [2] . Fig. 2 shows the images of Angoleta cocoa pods which are arranged in two lines. The first line contains the images of mature Angoleta pods and the second line the unmatured Angoleta pods.

• Guiana
Guiana is a little-known and endemic cocoa variety from French Guiana. Guiana cocoa differs from other cocoa trees by its characteristics such as marked furrows and verrucosity of its surfaces [3] .  • Data directory structure The directory to the dataset is structured as follows: A parent directory named CocoaMFDB, then in the CocoaMFDB directory, two sub-directories named mature(ripe) and unmature(unripe) were set. Finally, each folder hosted the Amelonado, Angoleta, and Guiana directories. Fig. 4 shows the directory structure of the dataset • Data summary Table 1 presents a summary of the cocoa pods images according to their type and maturity status.  Table 1 .

• Cocoa pods collection
Images collection took place in Yakassé 1 Grand Bassam on the following dates: May 26, 2022, and August 18, 2022. The images were captured from different angles in an uncontrolled environment, including various obstacles such as varying weather conditions such as wind, rain, fog, sun, etc., as well as changing lighting conditions such as the brightness, color, and direction of light, with the presence of shadows and reflections. The images contain occluded stems and leaves and cocoa pods.
• Data pre-processing Contrast Limited Adaptive Histogram Equalization (CLAHE) is useful for enhancing digital images and is a variant of Adaptive Histogram Equalization (AHE) that supports contrast boosting. CLAHE operates on small regions of the image, called tiles, rather than on the entire image. The neighboring tiles are combined using bilinear interpolation to remove artificial boundaries [4] . Fig. 6 represents the general methodology of the quality improvement process of the dataset. In this step, the image has initially used the conversion of RGB color images to Lab color images. The Lab color space uses the concept of coloring based on luminance using the brightness (L) of white-black and chromatic components (a. red-green and b. yellow-blue). The Lab color space is a development of the CIEXYZ color space, with values L that has a value between 0 and 100 that represent respectively black to white, as for the chromatic representation of red-green and yellow-blue colors, each is between -128 and 128 [5] .
The conversion from RGB color space to Lab was done using the equation below: (3) Fig. 7 shows the images convertion from RGB color space to Lab -Step 4: We now applied the CLAHE algorithm which follows these five steps • Divide the image into small regions.
• Then use another word on the mapping functions of the local histogram.
• Then another word the clipping point of the histogram.
• And another expression the function to each region.
• Finally, he reduces the noise by the background subtraction method. Fig. 9 shows the result obtained after the application of the CLAHE algorithm on the L channel of the image   • Labeling Data tagging refers to the process of adding tags or labels to raw data such as images, videos, text, and audio. These tags form a representation of the class of objects to which the data belongs and help a machine learning model learn to identify that particular class of objects when encountered in data without tags. The Labeling tool is used to annotate the positional information of cocoa pods in the image. It is annotated according to the PASCAL VOC 2007 dataset format which represents a dataset containing several labeled images and annotated files. It automatically generates corresponding XML annotation files [6 , 7] . Fig. 12 shows an image and the file generated.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that have, or could be perceived to have, influenced the work reported in this article.

Ethics Statements
The study does not involve experiments on humans or animals