Automated detection of microfossil fish teeth from slide images 1 using combined deep learning models

ABSTRACT

accumulation of REY in pelagic clay was caused by changes in bioproductivity and ocean circulation, which reflects changes in the Earth's climate system (Ohta et al., 2020).This indicates that examining environmental changes recorded in pelagic clay is essential for understanding the genesis and distribution of industrially critical metal resources, emphasizing the increasing importance of analyzing pelagic clay.
Depositional age is key information for understanding depositional environments of the seafloor sediment because the environment has been affected by a secular change in global climate (Westerhold et al., 2020;Zachos et al., 2008) and plate motion over geologic timescales (Müller et al., 2018).However, calcareous or siliceous microfossils, which have commonly been used for constraining depositional ages of the seafloor sediment, are not found in pelagic clay, owing to the dissolution of the fossils by undersaturation of carbonates and silica in the deep-sea environment.This has hampered examination of the depositional environment and exploration of the origin that controls distribution of deep-sea resources.
In contrast, fish teeth and denticles, known as ichthyoliths, are well preserved in almost all kinds of seafloor sediments because they are composed of calcium phosphate, which is not easily dissolved (Sibert et al., 2014).Therefore, ichthyoliths have been used as a key for constraining the depositional age of pelagic clay (Doyle et al., 1974;Doyle andRiedel, 1979, 1985;Ohta et al., 2020).In addition, ichthyoliths are regarded as indicators of depositional environments recently.The productivity of pelagic fish has been measured based on the accumulation rate of ichthyoliths (Sibert et al., 2014(Sibert et al., , 2016(Sibert et al., , 2020;;Sibert and Rubin, 2021), the evolution of pelagic ecosystems has been explored based on variations in morphotypes (Sibert et al., 2018;Sibert and Rubin, 2021), and the distribution of pelagic fish has been studied based on variation in the length of fish teeth (Britten and Sibert, 2020).Hence, establishing an effective method for ichthyolith observation will enable understanding of the records on the evolution of pelagic realms which has long been a black box in Earth science.
Traditionally, ichthyolith analysis first involves extracting coarse-grained particles from the target sediment.By observing these grains under a stereomicroscope, ichthyoliths are manually picked up and moved on to a slide using a fine-pointed brush.This process, called 'handpicking', remains a common technique for both stratigraphic and environmental research (Ohta et al., 2020;Sibert et al., 2017) and is one of the most time-consuming processes in ichthyolith analysis.Slides with the ichthyoliths are then observed under a microscope for detailed description and identification.Observers describe a range of features including their outer shape, inner structures, and size (Britten and Sibert, 2020;Doyle and Riedel, 1979;Sibert et al., 2018), which also requires considerable time and effort by experienced experts.
In comparison to these manual techniques, recent developments in computer vision have achieved promising results in various fields including medicine, neuroscience, and robotics (Jo et al., 2017;Kim et al., 2018;Sakai et al., 2018;Shoji et al., 2018;Suleymanova et al., 2018).Techniques in computer vision have also been applied in the field of microfossil research for the tasks of classification and detection.The classification of microfossils was first attempted by obtaining key morphological parameters from microfossil images (Marmo et al., 2006;Yu et al., 1996), with support vector machines (SVMs) contributing to their classification according to the acquired values (Apostol et al., 2016;Bi et al., 2015;Hu and Davis, 2005;Solano et al., 2018;Xu et al., 2020).Owing to the development of convolutional neural networks (CNNs), deep learning based classification models have successfully been used to determine the taxa of various microfossils including foraminifera and radiolarians (Carvalho et al., 2020;Hsiang et al., 2019;Itaki et al., 2020;Keçeli et al., 2017;Marchant et al., 2020;Mitra et al., 2019;Pires de Lima et al., 2020;Xu et al., 2020;Tetard et al., 2020).Although some of these classification models achieve an accuracy of > 85% (Hsiang et al., 2019;Itaki et al., 2020;Marchant et al., 2020;Tetard et al., 2020), large training datasets are often required, which creates the challenge of generating a large number of images for each microfossil species.To address this problem, previous studies (Beaufort and Dollfus, 2004;Hsiang et al., 2018;Itaki et al., 2020;Tetard et al., 2020) have proposed a method that captures the entire area of a slide.In these studies, individual particles were extracted from the image based on thresholding, which may reduce the efficiency of ichthyoliths observations for the following two reasons.First, particles have to be positioned on the imaged slides without overlap, which can be practically difficult when using glass slides.
Second, ichthyoliths are translucent when observed under a polarized light microscope, which makes determining an appropriate threshold challenging.
Here, as a first step toward using deep learning for ichthyolith observation, we describe a deep learning based system that can detect microfossil fish teeth from glass slide images and predict their lengths.The system is composed of open-source libraries, so that it can be readily applied to a range of detection problems within the geosciences.

System overview
Our system is divided into two parts: (1) the detection of fossil fish teeth from slide images by a single-class object detection model and (2) the precise classification of the detected particles by a two-class classification model (Fig. 1), each described in the following sections.
The system is designed to require little manual work.The initial detection results by Mask R-CNN model are exported to an Excel sheet.Then, in the classification by EfficientNet-V2, images of individual particles are automatically generated from the slide images and the Excel sheet.This means that there is no need to export all the Mask R-CNN detection results to an image (which would take up a lot of data space), or to manually move files from one folder to another.

Detection using Mask R-CNN
The slide images are processed using the object detection model "Mask R-CNN."Mask R-CNN is an open-source model that is capable of semantic segmentation and has a deep-learning-based algorithm that predicts the label in every pixel of an image (He et al., 2017).ResNet-101 (He et al., 2016) was used as the backbone of the model.Training was performed to minimalize loss function originally defined for the Mask R-CNN model, which consists of the sum of losses in classification, predictions of bounding boxes, and masks (He et al., 2017).The stochastic gradient descent with momentum (SGDM) with a momentum of 0.9 was used as an optimizer and the initial learning rate was set to 0.001.The input image size was set to 640 × 640 pixels.

Re-classification using EfficientNet-V2
Although the fully trained Mask R-CNN model can predict the classes of the objects detected, we found that the model was unable to learn the features of fish teeth with our dataset.Therefore, we combined it with another open source deep learning model, 'EfficientNet-V2' (Tan and Le, 2021) (Ketkar, 2017).An SGDM with a momentum of 0.9 was used as an optimizer and the initial learning rate was set to 0.005.The class determined by the image-classification model was taken as the final class predicted by the system.In other words, even if a particle was predicted as a "tooth" by the Mask R-CNN model, it was considered "noise" if it was classified as such by the EfficientNet-V2 model.

Preparation of slide images
Glass slides were prepared from the pelagic clay samples collected at Ocean Drilling Program (ODP) Site 1179 and piston core site MR15-E01 PC11 in the western North Pacific Ocean, and Integrated Ocean Drilling Program (IODP) Sites U1366 and U1370 in the South Pacific Ocean.The locations and water depths of these sites are summarized in Table S1.The method for preparing the slides followed previous studies on the determination of depositional ages (Doyle and Riedel, 1985;Ohta et al., 2020) with some modifications as described by Sibert et al. (2017).Approximately 5 g of the wet sediment sample was first well mixed with deionized water in a plastic bottle, and then sieved through a 62 μm mesh to collect the larger particles.Heavy liquid separation was then used to concentrate biogenic calcium phosphate grains.The particles were well mixed with a solution of sodium polytungstate (SPT; specific gravity = 2.80−2.85g/cm 3 ) and centrifuged at 1000−1500 rpm.The collected particles were washed with deionized water, placed on glass slides using a pipette, dried at 40 °C, and then sealed with a cover glass using a light-curing adhesive.Microscopic images of the entire area of the prepared slides were automatically captured using an RX-100 digital microscope (Hirox Co., Ltd.).This microscope has a motorized stage that moves gradually to divide the observation area into small squares, which can be continuously imaged.The magnification of the microscope was 200× (each pixel = 0.96 × 0.96 μm), and approximately 1000 images of 1200 × 1200 pixels were generated from a single slide.

Training of the object detection model
A total of 958 slide images with at least one ichthyolith were prepared to train the Mask R-CNN model.For these images, ichthyolith contour and class information was annotated using the VGG Image Annotator (Dutta and Zisserman, 2019).The dataset was randomly split into a training dataset, which was composed of 958 images with annotation data for 1625 teeth, and a validation dataset composed of 92 images and annotation data for 165 teeth.
The mask R-CNN model training was conducted using the online cloud service Paperspace (https://www.paperspace.com/).To augment the dataset, the images were randomly flipped upside down and/or left-to-right during the training.The initial learning rate was set at 0.001, and the model was trained for 80 epochs.The progress of learning was monitored by calculating the losses implemented in the Mask R-CNN library for both the training and validation datasets.

Training the image classification model
Particles within the slide images were trimmed from the classification model dataset.These particles were manually labeled into the 'tooth' and 'noise' classes.Examples of 'noisy' particles are fish bones, opaque grains that are possibly micro ferromanganese (Fe-Mn) oxides (Yasukawa et al., 2020), and the edges of light-curing adhesives (see Fig. 1b).The EfficientNet-V2 model was trained using the Google Colaboratory Cloud service (Carneiro et al., 2018).During the training, the images were randomly flipped upside down and/or left-to-right to prevent overfitting.The learning rate was set at 0.005, and the model was trained for 20 epochs.The progress of the learning was monitored by calculating the losses and accuracies for both the training and validation datasets.

Tests for the practical use of the system
In addition to the validation of each model, we conducted a practical test to verify the performance of the entire system.A total of 5177 slide images from six glass slides were generated from a single sample (ODP Site 1179, section 24, Core 5, 75−77 cm interval).This sample was not used for any of the training or validation datasets.
Annotation data for the locations of the 431 teeth within the images were prepared.The images were first subjected to detection using the trained Mask R-CNN model.By comparing the annotated data and the model predictions, the number of true positives (TPs), false positives (FPs), and false negatives (FNs) were determined.
TPs represent the numbers of teeth that were correctly predicted as teeth by the model.FPs represent the numbers of non-teeth particles that were incorrectly predicted as teeth.FNs represent the numbers of teeth that were not detected by the model.Using these values, several evaluation parameters were calculated as follows:

Measurement of ichthyolith length
The dimensions of ichthyoliths are key for their accurate classification.Here, we defined the length of a tooth as the perpendicular length from the apex of the outline to the lowest level (Fig. 2a) based on the traditional ichthyolith description system (Doyle and Riedel, 1979).Given that variation in tooth length can be used as an indicator of variation in the body sizes of pelagic fish (Britten and Sibert, 2020), we attempted to predict the lengths of teeth automatically, by approximating the detected contours of each tooth within a rectangle and measuring the length of the longest side (Fig. 2b).This approach was based on the assumption that most teeth have an elongated shape (Britten and Sibert, 2020).To evaluate the accuracy of the acquired lengths, tooth lengths were manually measured in the images using the following three steps: (1) the start and end points of measurement were determined manually based on the definition in traditional ichthyolith biostratigraphy, (2) distance between start and end points was measured in pixels using a PC application named "PhotoRuler" (http://inocybe.info/_userdata/ruler/help-eng.html), and (3) the distance in pixels was converted to μm using the resolution of the image (1 pixel corresponds to 0.96 μm).S2).At this threshold score, the practical test resulted in 78.6% of recall, 89.0% of precision and 83.5% of F1 score.

Detection of fish teeth
Detection by the Mask R-CNN model alone resulted in high recall and very low precision, which may be due to the loose criteria for judging particles as teeth.Using this model, almost all the ichthyoliths were correctly detected, while many non-tooth particles were incorrectly classified as teeth by the trained Mask R-CNN model.Therefore, Mask R-CNN model alone does not represent a time-saving approach because manual intervention is still needed to correctly identify ichthyoliths from a large number of detected particles.
On the contrary, the combined system showed significantly higher precision and slightly lower recall (Fig. 3).
This indicates that the EfficientNet-V2 model is effective at identifying fish teeth from the large number of particles detected by the Mask R-CNN model.The F1 score was 83.5%, which is 8 times higher than that of the Mask R-CNN model when used alone.
For application of this system in stratigraphic research, it is important to detect clear and distinct ichthyoliths with a small number of false positives, even if small and obscure ichthyoliths are not detected.In this case, a threshold score of 0.45 should be used to obtain the highest F1 score.In environmental research, the total number of ichthyoliths within a sample is an important proxy.A threshold score of 0.1 and manually checking the detection results can minimize the occurrence of false negatives.Although this approach requires some manual labor, it is much more time-efficient than the previous handpicking process..

Measurement of ichthyolith length
The scatter diagram for the lengths of the teeth predicted by the contours of the detection results and manually measured lengths is shown in Fig. 4. In three cases (out of 341), the predicted lengths were significantly shorter than the measured length, which occurred when the Mask R-CNN model was unable to determine the contours of the model.However, overall, the predicted lengths of 90.6% of the detected teeth were within ± 20% of their measured lengths.This indicates that as well as their detection and classification, our system provides an efficient means of determining the length distribution of fossil fish teeth.

Implications for the wider application of object detection in the geosciences
There are many fields within the geosciences in which images are used to detect and/or count target objects (Ohta et al., 2016(Ohta et al., , 2020;;Takahashi et al., 2009;Usui et al., 2017).Automation of these tasks using object detection techniques has the potential to acquire a greater number of results and enable more comprehensive investigation than has been previously possible.However, object detection has not yet been widely applied in the geosciences, with the exception of remote sensing (Zhang et al., 2020).This can be attributed to the difficulty in generating the large learning datasets required for precise detection.This is hindered by the requirement for special equipment, such as microscopes (polarizing microscopy, stereoscopic microscopy, and scanning electron microscopy) and computed tomography (CT) scanners.This has cost, time, and manual labor implications that can make the acquisition of a large number of images impractical.Second, the annotation process of object detection often requires skilled expertise, compared with more applied fields of research such as robotics, medicine, and materials science, and devoting sufficient resources (both budgetary and personnel) to the annotation process may be less prioritized in this field.
Our study shows that a relatively small dataset (< 1000 microscopic images containing approximately 1800 teeth) is sufficient to train the Mask R-CNN model to detect the contours of possible teeth, although when used alone, it was not sufficient to distinguish the teeth precisely.Therefore, the best overall performance was achieved by fully training a model focused on the classification of the predicted regions, which requires much less time and manual labor than preparing a large dataset for the Mask R-CNN model.This indicates that challenging object detection problems can be efficiently addressed by dividing the task into two subtasks i.e., extracting the contours of candidate objects and then precisely classifying the objects based on the extracted contours.This implies that object detection may be applied in various fields in the geosciences, especially where the acquisition of large training datasets for object detection has proven to be challenging.

Conclusions
We developed and tested a system to detect fossil fish teeth from slide images by combining two open source deep learning models-the object detection model 'Mask R-CNN' and the image classification model 'EfficientNet-V2'.The system provided results with 89.0% precision, 78.6% recall, and an F1 score of 83.5% in a test that assumed realistic conditions, indicating its potential for practical application.In addition, the system successfully derived the lengths of 90% of the detected teeth with an accuracy of ± 20%.As such, the system has potential for constraining both the depositional ages and environments of deep-sea sediments and, more broadly, contributing to research on the evolution of the marine ecosystem.Additional work is now being undertaken to update the EfficientNet-V2 model so that ichthyoliths can be further classified into morphological taxa.This requires a larger dataset of ichthyolith images, which could be compiled with the support of the system 2. Figure 2 List of Tables 1. Table S1.Locations and water depths of the analyzed sites.
2. Table S2.Results of the practical test with varying threshold confidence scores of the EfficientNet-V2 model.

Epoch
, which discriminates the classes of the particles detected by the Mask R-CNN model.Images of particles detected by Mask R-CNN were resized to 224 × 224 pixels without changing the aspect ratio.These images are then classified into 'tooth' or 'noise' classes by the trained EfficientNet-V2 model.Training of EfficientNet-V2 model was performed to minimize categorical cross entropy implemented in the Python library keras extent to which the model misclassified particles as teeth.Recall represents the extent to which the model failed to detect teeth.The F1 score is the harmonic mean of precision and recall, indicating the overall balance of the model.After evaluation of the Mask R-CNN model detection results, all of the detected particles were re-classified using the EfficientNet-V2 model, and the precision, recall, and F1 scores were recalculated.

Figure
Figure S1 shows the loss function trend for each training epoch.Although the loss values for the training dataset photographs showing the method to obtain "predicted length" estimation methods.

LossFig. S1 .
Fig. S1.Losses for training and validation datasets during the training of the Mask R-CNN model.Blue line indicates the loss calculated for training dataset and orange one indicates the loss for validation dataset.

Table S2 .
Results of the practical test with varying threshold confidence scores of the EfficientNet-V2 model.