1 Introduction

Plankton are tiny organisms living in the ocean . They are fundamental to the ocean’s food chain, serve as an indirect indicator of water pollution (Suthers and Rissik 2019), and fix approximately 40% of the world’s carbon (Falkowski 1994). When researchers started estimating the carbon cycle, they found it to be straightforward for terrestrial areas (Table 1, Riley (1944)), while the opposite was found for the ocean. The first approaches gave the ocean eight times the carbon fixation yield of land (155 Gt C yr\(^{-1}\), Table 1.III, Rabinowitch (1945)). Later, this quantity was reduced to slightly less than the terrestrial value (15 Gt C yr\(^{-1}\), Nielsen (1952)). Currently, this estimation lies between the cited previous values (approximately 69 Gt C yr\(^{-1}\), del Giorgio and Duarte (2002)), but still presents considerable uncertainty. As an integral and necessary entity in the global carbon and nutrient cycles (Blaschko et al. 2005; Sieracki and Webb 1991) and a key element in regulating the planet’s temperature (Blaschko et al. 2005; Hays et al. 2005), many approaches were proposed during the last few decades for accurately assessing the role of plankton in the global ocean.

The abundance, size, and taxonomy estimates of plankton organisms are of paramount importance for accounting for the role of living marine components in the flux, export, and sequestration of carbon to the interior of the ocean. Autotrophic organisms uptake CO\(_{2}\) for photosynthesis in the euphotic zone, allowing them to fix anthropogenic carbon into organic matter in the ocean. This carbon circulates through the trophic web of the ocean, and a small fraction is transported downwards through passive and active fluxes as well as physical mixing (Buesseler et al. 2007). Passive flux is related to the sinking of organisms and particles, while active flux is transport carried out by zooplankton and micronekton migrants by consuming carbon in the shallower layers of the ocean and respiring, excreting, egesting, and dying in the meso- and bathypelagic zones (Hernández-León et al. 2019, 2020). The carbon transported by planktonic organisms towards the deep ocean will remain there for decades or centuries, avoiding the CO\(_{2}\) accumulation process in the atmosphere and avoiding a faster increase in the planet’s temperature. Another important fraction of this carbon will be respired in the upper layers of the ocean, returning to the atmosphere. The role of phyto- and zooplankton in this transport system is of main concern in climate change studies. To account for these fluxes, knowledge of the biomass and physiology of the ocean biota is required (e.g., Garijo and Hernández-León (2015)). Both parameters can be assessed through the abundance, sizes, and taxonomies of organisms. Since the direct counting, sizing, and classification of the microscopic plants and animals of the sea is quite laborious and time-consuming, oceanographers have relied on technology to more quickly quantify these organisms.

The first method employed a microscope to estimate the inorganic and organic suspended matter retained in a molecular filter (Goldberg et al. 1952). Since microscopic work is tedious, Jerlov (1955) employed a Tyndall metre, which presented a relationship between seawater scattering and the total surface area of the particles contained on it. Although the Tyndall metre readings had an accuracy of ±10% above that of a microscope, the predominance of small and transparent particles led to measurement problems, thus highlighting microscopy as a superior method.

After succeeding in counting blood cells (Mattern et al. 1957), scientists began to use the impulse Coulter counter (Model A) to count marine organisms (Hastings et al. 1962; Maloney et al. 1962). Organisms placed in an electrically conductive medium pass through a small aperture, producing a voltage drop depending on the organism’s size. Then, these impulses are amplified, recorded, and visualized. Organisms with diameters from 3 \(\mu\)m to 1 mm are counted and sized by using different aperture sizes (Sheldon and Parsons 1967b), with densities ranging from 50 to 100,000 cells/ml (Hastings et al. 1962). This was a significant advance in accounting for the abundance of phytoplanktonic cells, in terms of not only accuracy but also counting speed. The posterior Coulter counter version (Model B, Sheldon and Parsons (1967a)), jointly with an automatic cell-size distribution plotter (Model J), even allowed researchers to measure aggregate forms (i.e., chain-forming diatoms, Parsons (1965)). Despite its accuracy and speed, this method counts all falling particles in the same size range (Rehnberg et al. 1982) without regard to live or dead cells, fragmented cells, or debris. Maddux and Kanwisher (1965) developed an in situ particle counter by applying the same Coulter counter principle but installing it at the cod end of a tow-net. Later, other systems were developed, such as the automated plankton counter (Cooke et al. 1970), which are capable of sizing larger particles than those handled by the Coulter counter. Its operating mode is as follows: an organism’s image is projected on an array of photosensors while a fluid stream carries it; when the path of light to the first photosensor is blocked, an impulse sequence initiates, finishing when the first sensor is again unblocked, indicating the organism’s passage. This instrument was practical for counting and sizing organisms, eclipsed only by sample clogging (Fulton 1972). In 1971, acoustics was incorporated into plankton quantification (Beamish 1971). However, until now, captured samples were needed to know what the echo caught; this idea is known as ground truthing (McClatchie et al. 2000). This method is helpful for spatial distribution determination but not for identification.

Significantly, none of these techniques found general acceptance by the scientific community, probably because they were not explicitly developed for dealing with this specific scenario (Jeffries et al. 1981) and could not discriminate between taxonomic groups (Jeffries et al. 1984). Furthermore, all these methods were developed to process water samples. In contrast, in 1953, Nishizawa et al. (1954) captured in situ underwater photographs of suspended matter and plankton using a so-called undersea observation chamber (Fig. 1). Therefore, the authors observed a more common distribution of larger particles than those reported in the literature (Goldberg et al. 1952; Jerlov 1953). They considered that the water sampling process could damage or disintegrate these large particles. Other authors agreed with this (Riley 1963; Mullin 1965), noting how natural aggregations could be broken up during sampling or rendered unidentifiable and biased when using standard preservation techniques (Murphy and Haugen 1985; Ortner et al. 1981; Zarauz and Irigoien 2008), “which probably accounts for the fact that this phenomenon passed largely unnoticed until the advent of direct undersea observation” (Riley 1963). Hence, in situ water photography presented an excellent sampling approach compared to traditional net-sampling techniques, demonstrating the advantage of sampling the fragile taxa that would be damaged in other circumstances (Olney and Houde 1993; Tiselius 1998).

Fig. 1
figure 1

Initial approaches for plankton quantification

Since the 1980s, many plankton imaging systems have emerged, from laboratory to in situ underwater video systems (Table 1, Lombard et al. (2019)). These systems have produced a massive backlog of unlabelled plankton data, which, in conjunction with the limited availability of human taxonomic expertise, represents a thread for future scientific studies (Simpson et al. 1992; Culverhouse et al. 2006; MacLeod and Association 2007). This review aims to describe and compare all the image classification algorithms that have been implemented to date for marine plankton. In this way, we consider that this work may be helpful for those who want to develop new image classification algorithms for application to marine plankton in the future.

The rest of the paper is organized as follows. Section 2 briefly defines plankton. Section 3 describes the selected criteria for this review. Section 4 divides the work developed in the selected papers grouped into six categories depending on their classification techniques. Table 1 presents a summary of these reviewed papers. In Sect. 5, the conclusion and future research directions are discussed.

Table 1 Summary of the reviewed papers

2 A look at plankton

Plankton drift into the ocean in swarms, a concept known as patchiness in biological oceanography (Folt and Burns 1999). These small organisms range in size from tenths of microns up to centimetres are categorized into phytoplankton (mainly autotrophs, acting as plants) and zooplankton (mainly heterotrophs, acting as animals). Figure 2 shows snapshots of plankton that belong to four valuable datasets found while producing this review. A vast and diverse image dataset is vital for developing new classification procedures.

Fig. 2
figure 2

Samples from four plankton datasets

Figure 2A shows a phytoplankton sample (dinoflagellate, ceratium) from the dataset released by the Woods Hole Oceanographic Institution (WHOI). The WHOI-Plankton dataset was one of the first public databases collected in situ with the Imaging FlowCytobot (IFCB) at Martha’s Vineyard Coastal Observatory, Massachusetts (Sosik 2015). It contains more than 3.5 million labelled images falling into 103 categories.

Figure 2B shows a zooplankton sample from the PlanktonSet dataset (crustacean, amphipod) released by Oregon State University’s Hatfield Marine Science Center. This dataset was initially offered to the National Data Science Bowl (NDSB) competition, collected with the In Situ Ichthyoplankton Imaging System (ISIIS) in the Straits of Florida (Cowen et al. 2015). It contains 30,000 labelled images falling into 121 categories, plus a test set of unlabelled images.

Figure 2C shows an image containing two phytoplankton from the PMID2019 dataset (diatom, coscinodiscus - rounded one, and dinoflagellata, ceratium fusus - elongated one) released by the Ocean University of China, Qingdao. This was the first available high-resolution phytoplankton colour dataset captured with an Olympus BX53 fluorescence microscope from preserved samples collected in Qingdao Jiaozhou Bay, Shandong (Li et al. 2019). It contains over 10,000 labelled phytoplankton images from 24 categories.

Figure 2D shows a zooplankton sample from the DYB-PlanktonNet dataset (crustacean, shrimp-like) released by the Chinese Academy of Sciences, Shenzhen Institute of Advanced Technology. This dataset was collected with an innovative in situ colour imager installed on a buoy in the South China Sea near Shenzhen city (Li et al. 2021b). It contains over 46,000 labelled plankton images from 90 categories.

3 Article selection criteria

The reviewed papers were selected based on the following exclusion criteria.

Recent papers classifying low-quality captured images were discarded because they lacked details and were blurred or out of focus, making their classification processes even more complicated and distracting from the aim of this paper.

Classification studies using flow cytometry (FC) data were discarded because they employed fluorescent and light scattering characteristics for classification instead of a specimen’s picture (McKinnon 2018). However, we included papers that classified images acquired with the Imaging FlowCytobot or the FlowCam; these instruments are based on the FC principle but add video technology to produce high-resolution photos.

Papers classifying images with specific features related to their acquisition methods, such as fluorescence microscopy (Blackburn et al. 1998; Rodenacker et al. 2006; Ng et al. 2017), polarized microscopy (Tiwari and Gallager 2003), or luminous events of bioluminescent plankton (Kocak et al. 1999), were discarded. These classification procedures employ features that are different from those extracted from a specimen’s image.

In addition, articles with low correct classification rates were also excluded (<60% accuracy), as were articles with automatic \(> 99\%\) classification rates (e.g., Loke et al. (2004)). Most recent articles with similar procedures to those used in the past were not considered (e.g., the work of Zhou et al. (2008) was similar to that of Liu et al. (1994) and Thonnat and Gandelin (1988)). Articles with equivalent procedures but achieving meagre accuracy improvements were discarded (e.g., Kramer et al. (2011) improved their accuracy by as much as 2.1%), and articles categorizing fewer than four classes were also discarded.

This study reviewed novel techniques in plankton classification, categorizing black and white or greyscale images, from their beginnings until today.

4 Image classification chronology

In 1970, Health Sciences established the techniques used in the forthcoming automated plankton identification systems, with a special-purpose computer extracting a set of parameters describing blood cell features for later identification using pairwise separation (Ingram and Preston 1970). Due to the inability to perform automatic plankton identification with these methods (see the introduction), Fawell (1976) introduced a similar technique using image analysis equipment (Quantimet 720, IMANCO (1970)) to extract morphometric features from plankton samples for later use in an image classifier. This strategy revealed the path for future plankton identification works until the advent of deep learning, where feature extraction and classification are intimately tied together.

4.1 Classification via linear discriminant analysis

Linear discriminant analysis (LDA) is a popular supervised classification technique that assumes a normal or Gaussian distribution for data points and identical covariance matrices for each class. It is also known as Fisher’s LDA, as Ronald A. Fisher developed it in the 1930s. LDA aims to find the projection hyperplane that maximizes the distance between the projected means of the given classes and minimizes the variance within the classes; this can be stated as a generalized eigenvalue problem in a multiclass scenario. If c is the number of training classes, we can find \(c-1\) discriminatory directions that separate the c classes as much as possible (the largest eigenvectors, Fig. 3).

Fig. 3
figure 3

Iterative construction of an optimal hyperplane in a 2-D feature space (Schlimpert et al. 1980)

Uhlmann et al. (1978) designed a pattern recognition system for phytoplankton. For image acquisition purposes, they built a custom camera device (vidicon) to scan either phytoplankton photographs or species preserved in formalin. After performing contrast enhancement, the control unit transferred the scanned images to a computer (Robotron KRS 4200), which processed the patterns into a 2-D power frequency spectrum, defining a position in the polar space that was independent of translations and rotations. Two years later, Schlimpert et al. (1980) published the details of this work. To construct a linear classifier, they prepared a set of training patterns comprising five classes, which were iteratively presented to the classifier while tuning its parameters. As this technique produced unsatisfactory results, they combined two interlacing classes, trained the classifier again with four classes, and separated these combined classes in a posterior second stage. They obtained a 93.2% correct classification rate when categorizing the five classes. Even with this remarkable result, they expected that a better classification accuracy could be obtained if the scan resolution was improved. The 2-D power frequency spectrum processing step took approximately 10 s per pattern, and the classifier converged between 3 and 170 s, depending on the class. Nevertheless, the authors were aware that higher speeds could be attained by using parallel processing. Given the resolution of the scanned images shown in their paper, this work was a remarkable effort at the time. However, LDA is not the most appropriate method when some input variables are correlated, as shown in this work.

Jeffries et al. (1980) also employed LDA but instead categorized zooplankton. The idea was to classify species into ecologically and taxonomically meaningful groups. To this end, they identified which species were viable for classification according to morphometric relations. They employed two datasets: photocopied plates of species obtained from published literature and contours of preserved specimens traced by hand on acetate sheets; the first was captured with a vidicon, and the second was obtained with the Bausch and Lomb QMS system. The first dataset was classified into six shape categories using LDA by the jack-knife procedure, yielding a 93.1% correct classification rate. The second dataset was classified into 19 major species using the LDA-based nonpooled covariance procedure, reaching a 97.9% correct classification rate due to the combination of four morphometric relations. This work marked the beginning of automatic zooplankton image classification, achieving an 85% time reduction compared to manual processing in routine laboratory work.

Four years later, Jeffries et al. (1984) built a custom vidicon system to capture better plankton images. As a novelty, six satellite microprocessors were employed for feature extraction. These microprocessors worked in parallel under a central computer, obtaining an 89% correct classification rate when categorizing eight groups with the same five features as those in their previous work (Jeffries et al. 1980). Katsinis et al. (1984) improved this classification process by employing eight satellite microprocessors to extract nine morphometric features, obtaining a 92% correct classification rate in less than half the time required by Jeffries et al. (1984). These studies employed custom edge detection techniques to erase all zooplankton antennae and swimming legs, which led to problems when performing feature extraction. Although the plankton group classification results were promising, species identification was still impossible due to the poor quality of the input images. Additionally, it was necessary to manually check that no organisms overlapped before scanning. These works are noteworthy due to the star architectures of their central computers and satellites (the microprocessors), demonstrating how easy it is to implement an LDA classifier. However, as stated before, an LDA classifier may not be the most appropriate classifier when similar plankton species exist.

4.2 Classification via hierarchical clustering

Hierarchical clustering (HC) is an unsupervised classification algorithm that organizes a group of unlabelled data points into a tree-shaped structure (called a dendrogram). The dendrogram can be built in a bottom-up manner, in which all data points are initially considered as single clusters that are iteratively merged until only one cluster remains, or in a top-down manner, in which all data points initially belong to a single cluster that is recursively split until each cluster includes only a single data point. These approaches are named agglomerative and divisive HC, respectively (Fig. 4). HC gained notoriety after the publication of the "Numerical Taxonomy" book (Sneath and Sokal 1973), where the authors presented the advantages of HC in the field of biology and its application to organism classification.

Fig. 4
figure 4

HC methods: agglomerative (A) and divisive (B)

Chehdi et al. (1986) from ENST Bretagne collaborated with the IFREMER in Brest using an agglomerative HC approach to classify 16 groups of zooplankton (Fig. 4A). A camera installed on a microscope captured images, and Freeman’s chain code was used to encode the contours. The appendages were removed to focus on bodies, separating those that seemed to overlap. They tested this system on 320 specimens by employing ten morphometric features; the algorithm could only find specimens falling into eight biological groups. Although this result was not the best, biologists liked how the approach gathered specimens, but improvements in image quality were mandatory for better classification. Six years later, Chehdi and Coquin (1992) optimized this method to classify 14 groups of zooplankton. As a novelty, they preprocessed the input images to resolve their problems concerning shadows, lights, and high-frequency noises. Seven features, five morphometric and two inertial features, were extracted to orient the specimens. They successfully classified 280 specimens into 14 groups, obtaining a 78.8% correct classification rate. As a drawback, this method required manual supervision to control the grouping at each aggregation level.

Thonnat and Gandelin (1988) applied a divisive HC method to classify Mediterranean zooplankton and their developmental stages for the first time (Fig. 4B). After performing photogram digitization and organism detection, they extracted morphological, densitometric, external and internal parameters. Then, marine biologists defined 59 prototypes containing descriptions for each class and subclass of zooplankton. Finally, they proceeded classify the plankton with 66 conditional rules, where each image took 122 s to process. They tested this classifier with only 40 zooplankton images but provided no information on accuracy, stating that they were working on constructing a more extensive dataset. This plankton application resulted from extending their previous work (classifying galaxies with HC) to other areas. While applying HC to plankton classification seems interesting, it may not be the most appropriate method because it does not use prior knowledge about the relationships between the different species. However, the employed tree-like structure provides a valuable visual representation, helping scientists understand how different species are related to one another and enabling them to identify patterns.

4.3 Classification via artificial neural networks

To mimic the architecture of the human brain, artificial neural networks (ANNs) were introduced in the 1940s. ANNs are composed of artificial neurons. Each artificial neuron has inputs and generates a single output, which can be provided as an input to other neurons. These connections, such as the synapses in a biological brain, can transmit a signal from one neuron to another. A weight value increases or decreases the strength of the signal traversing a connection. A typical ANN organizes its neurons into several layers: one input layer and one output layer with hidden layers between them. The weight values are usually adjusted by minimizing the difference between the processed output of the ANN and the target output. This training procedure is the most relevant aspect of forming a reliable model.

Simpson et al. (1991, 1992) published the first work that applied a ANN for plankton identification. Three years later, Culverhouse et al. (1994) employed the same network but increased the number of categories from two to five congeneric species, which were difficult to classify even for taxonomists. The topology of the ANN was a 15-3-5 fully connected network (every neuron in one layer was connected to every neuron in the next layer). Backpropagation was used to tune the weight values of the ANN based on the error rate obtained in the previous iteration. The employed dataset was the same as that used by Williams et al. (1994), which was comprised of digitized photomicrographs that were validated by six experts into two subdatasets: one for training and another for testing. After generating Fourier spatial frequencies as features, they ran several trials with a randomized dataset size. The minimum obtained mean test error was 23% on unseen images. Species that were difficult to classify by experts were also difficult to classify with the ANN due to how bad the Fourier components of these species were, as clearly seen in Fig. 5. This motivated the authors to look for other preprocessing techniques in conjunction with adjustments to the Fourier technique and their ANN. The authors knew that there was room for improvement because Williams et al. (1994) proved that this dataset was discriminable with 95% confidence by conducting a multivariate statistical analysis on the Fourier-transformed data. This work is exemplary for understanding how ANNs can be sensitive to data quality, performing poorly when the input data are noisy or contain outliers.

Fig. 5
figure 5

Sixteen-bin FFT histograms for C. vanhöffeni and C. convallaria. An ANN correctly classified 53% of the C. vanhöffeni samples and 98% of the C. convallaria samples (Culverhouse et al. 1994)

Ellis et al. (1994) also employed the same technique used by Simpson et al. (1991, 1992) to classify four plankton species. As a novelty, they combined the Fourier spatial frequencies with the internal textures of the plankton bodies extracted using the Gabor function. They achieved a 73% accuracy rate using these combined features, but the accuracy was only 69.8% or 43.7% when using only the Fourier-transformed spatial frequencies or the texture features alone, respectively. This ground-breaking work was the first to look at the advantage of combining both types of features in the field of plankton classification. However, as shown in their paper, image preprocessing is crucial for attaining high accuracy, making this technique very time-consuming for plankton classification and other applications.

Culverhouse et al. (1996) developed an automatic system for classifying 23 species of dinoflagellates identified as having negative impacts on the aquaculture industry. They first digitized the specimens archived at the Plymouth Marine Laboratory and fresh samples collected by the Spanish Institute of Oceanography. Afterwards, these samples were labelled by experts to create a database. Then, they determined the edges using the Sobel operator, where 60 features were extracted, including shape and texture features. Later, they compared the classification performance of two ANNs with that of two classic multivariate statistical techniques. The radial basis function (RBF) ANN performed the best with an 83% classification accuracy compared to that of taxonomists, who achieved an 85% rate. The authors highlighted these results because the classifiers were trained and tested from field-collected specimens instead of cultured specimens. While this system could not handle overlapping specimens, detritus-contacting samples did not appear to affect the accuracy of classifier because they were much smaller than plankton. This system evolved slightly, and in 2000, the software was designated as dinoflagellate categorization by an ANN (DiCANN) (Culverhouse et al. 2000). Although taxonomists slightly outperformed the ANN, this study evaluated how ANNs were superior to other discriminant methods, even to taxonomists, given that their better accuracy could fade away with psychological factors such as short-term memory, fatigue, boredom, recency effects, and positive bias. The potential sample analysis time reduction from human taxonomists to an ANN was from 120 to 5 min. Additionally, this work confirmed that data preprocessing is important for ANNs; therefore, the authors had to develop a specific preprocessing technique to manage debris.

Tang and his colleagues (Tang and Stewart 1996; Tang et al. 1998) produced a pattern recognition system to classify the massive quantity of images collected with the Video Plankton Recorder (VPR) (Davis et al. 1992). First, a video processing system rejected objects that were out of focus. Next, the system extracted features with three methods: moment invariants, Fourier boundary descriptors, and granulometric features; the three techniques grabbed shape information, while the latter method also acquired texture information. Later, the authors removed redundant information by reducing the feature vectors using the Karhunen-Loeve transform (KLT). Finally, they applied the supervised learning vector quantization (LVQ) ANN classifier to six classes of plankton. Due to the time-consuming nature of the classifier, they used a parallel training strategy. Regarding the results, they focused on which feature or feature combination gave the best accuracy instead of studying which were the best network parameters, obtaining a 92.2% accuracy on the test set using the combined feature vector. This work demonstrated the benefit of using granulometric features alone or in combination with other features, as shown in Table 1 of their paper, which helped counteract the noise, occlusion, and variability of the dataset. It is also worth noting the innovative parallel strategy used to reduce the training time of the LVQ algorithm.

Ellis et al. (1997) developed an alternative method for classifying six phytoplankton species. It combined the results of different individual classifiers into one result with an arbitration algorithm; this approach was established based on the idea that a group of experts produces a more accurate classification than an individual expert. They named their technique "classification by committee". First, they described the image textures with a Gabor filter, which enabled the input image to be segmented into a background and foreground. Then, five different shape and texture feature vectors were extracted from the foreground and individually employed to train five ANNs, obtaining five different classification performances. These five different results were combined with three algorithms based on several criteria: the majority of votes, the sum of votes, and the most confident network, where each network had a single vote. Among the three criteria, the most confident network combination offered a maximum accuracy of 84%. Finally, the authors compared these “classification by committee" techniques with another method called “classification by collective machines", which combined the five different obtained results with an ANN instead of criteria. This latter method slightly outperformed “classification by committee” (70.3% versus 68.7% accuracy, respectively). In all cases, this work showed that combining networks’ results with a criterion or an ANN resulted in higher accuracy than picking the best single ANN, but this was only true when the individual ANNs had comparable accuracies. This work tried to increase classification performance based on perspectives other than those used in previously reviewed papers. Although these techniques were innovative for plankton classification, the authors were inspired by prior works (Jacobs et al. 1991; Wolpert 1992; Battiti and Colla 1994).

Two years later, Dollfus and Beaufort (1999) applied position-normalized images directly on an ANN instead of using feature vectors. The authors employed free weights because neurons specialized better in specific places (in normalized images), increasing the probability of correct classification. Thus, they obtained an average correct classification rate of 86% when categorizing 13 species of calcareous nanoplankton. This classification system was named SYRACO2, proving helpful for natural forms with high intravariability. In addition, they found that algorithms performing position normalization needed to be more robust. Therefore, in 2004, they employed parallel ANNs (Beaufort and Dollfus 2004) in a system capable of categorizing nonposition-normalized images. Thus, the images were presented directly to the ANN (vision module), and in the case of weak identification, a second ANN (motor modules) performed up to five actions on each image (translation, rotation, dilatation, contrast, and symmetry) that helped to confirm or deny the initial weak identification result. This updated system achieved 91% accuracy in classifying 11 categories, with a speed of 7 s per image. It is notable that in this work, the authors attained a better accuracy by confronting the weak points of ANNs: the complicated preprocessing step and their effectiveness with correlated inputs. The direct image input process used by the network was similar to the procedure employed in a convolutional neural network (CNN).

Fig. 6
figure 6

Dual-classification system implemented by Hu and Davis (2006)

Hu and his colleagues (Hu and Davis 2006; Hu 2006) presented a new system employing two classifiers: the first was an ANN using shape-based features (the same as in Tang and Stewart (1996); Tang et al. (1998)), and the second was a support vector machine (SVM) using texture-based features (the same as that in Hu and Davis (2005)). Then, both classifiers’ outputs were combined, resulting in a final correct identification if both agreed to the same class or an unknown result otherwise, as shown in Fig. 6. This system achieved 64.7% accuracy compared to the 61% accuracy attained by Davis et al. (2004) on the same dataset using an LVQ ANN classifier. The dataset was collected with the VPR and comprised seven major categories. This dual classifier performed better for taxa with low relative abundance than a single classifier fed with a unique feature vector containing stacked shape and texture information. Although this work employed a different approach for combining classifiers’ results than the “classification by committee” technique of Ellis et al. (1997), they both documented that when using classifiers in parallel and merging their results with any strategy, a broader spectrum of image details can be covered, thus resulting in higher accuracies.

4.4 Classification via support vector machines

Support Vector Machines (SVMs) were introduced in the 1990s. The aim of an SVM is to find the hyperplane that classifies the given data points and has the maximum margin (the maximum distance between the data points of both classes). Contrary to LDA, an SVM makes no assumptions about the data at all. Moreover, if LDA produces an analytical solution, SVM-based classification is an optimization problem. An SVM does not make use of the entire input dataset. The subsets of data points that are closest to the hyperplane are called support vectors, and the SVM is optimized over these vectors. Then, the support vectors determine how the SVM discriminates between classes.

Luo et al. (2003) published the first work applying an SVM to classify the massive quantity of images collected with the Shadow Image Particle Profiling Evaluation Recorder (SIPPER) (Samson et al. (2001)). After performing noise suppression, they calculated general and specific descriptors to produce 29 features. Due to the binary nature of the images, they lacked texture information. They classified two datasets comprising six plankton species with a soft-margin SVM. The first was classified with 90% accuracy, and the second was classified with 75.1% accuracy. The low accuracy achieved on the second dataset was due to insufficient image quality and unknown objects that could not even be manually identified. To speed up the feature calculation process without compromising accuracy, they published an expansion of this work (Luo et al. 2004a) where they described a selection process to reduce the number of features from 29 to 15. Hence, they obtained similar accuracies for both datasets attaining a processing time of 160 s for 6000 images with a one-versus-one approach. They acknowledged the need to develop new features for the higher-resolution version of SIPPER (25mm p/pixel). It is remarkable how the authors wisely picked a soft-margin SVM to cope with the noisy nature of the dataset. A soft-margin SVM tries to maximize accuracy while allowing some misclassification.

Hu and Davis (2005) employed co-occurrence matrix (COM) features with an SVM to classify images captured with the VPR. Co-occurrence matrices are textural features that describe the frequency at which one grey tone appears in a specified location in relation to another grey tone on an image (Haralick et al. 1973). Thus, their system computed eight COMs per image to perform posterior classification with an SVM. They obtained a 72% accuracy compared with the 61% accuracy of Davis et al. (2004). Both approaches classified the same dataset comprising seven categories, but the latter with an ANN was obtained from Tang and Stewart (1996). In another attempt, the authors combined texture features with shape-based features as classifier inputs, obtaining a mere 1% accuracy improvement over that achieved with COM features alone. After several tests, they concluded that the main accuracy improvement was achieved due to the COM features instead of the nature of the classifier, confirming that texture-based features are ideal for classifying field-collected images. This work highlighted that features should be selected depending on the given image’s nature (captured in the laboratory or in situ); this step is more critical than selecting or tuning a specific classifier. Moreover, a COM has the advantage of low computational requirements. The COM-SVM technique reduced the manual identification process from weeks to several hours.

Fig. 7
figure 7

Accuracies achieved by different classification techniques (Lisin et al. 2005)

Lisin et al. (2005) tried using a new method to deal with plankton images in which occlusion and clutter were present. First, they computed global features based on shape and texture information and local features based on the scale invariant feature transform (SIFT). Then, the features were independently classified; the global features were classified by an SVM, and the local features were classified by a nonparametric density (NPD) classifier, attaining 55% and 52% accuracy, respectively. Later, the authors tested two methods for combining the previous classifiers’ outputs: stacking and 2-tiered hierarchical classification. They obtained the highest accuracy (65.5%) with the stacking combination method on a 14-category dataset captured with the VPR (Fig. 7). Stacking consisted of concatenating the outputs of different classifiers (in this case, the SVM and NPD outputs) and using the result as an input vector (meta-features) for a meta-classifier (an SVM), which produced the final output. As in previous works (Ellis et al. 1997; Hu 2006), the authors demonstrated that combining different classifier outputs resulted in higher accuracy. Nevertheless, they stated that stacking was rarely exploited in artificial vision applications. Although this ensemble technique delivered a low accuracy, it was something new in plankton classification.

Wang et al. (2006) also benefited from COM features, but this time they operated at the geometric and pixel levels instead of the texture level. First, they searched for the points of interest for use as the local extraction centres with a Harris detector. After performing normalization, they computed a 128-dimensional SIFT descriptor. Later, they generated patterns with an associative rule mining algorithm, keeping the most discriminant features by a Pearson correlation analysis. The authors called this local feature extraction process local co-occurring patterns (LCPs). Finally, they classified five phytoplankton species with an SVM-based one-versus-one approach, obtaining an accuracy of 83%. They expressed how this result was better than the 72% accuracy obtained by Culverhouse et al. (2003a, 2003b). Nevertheless, this comparison was inappropriate because the authors here categorized five species instead of the six species used in the cited work. It is interesting to note how during the heyday of SVMs, the authors reviewed thus far tried to achieve improved accuracy by researching new features instead of dealing with SVM parameters, which led to novel visual representations such as LCPs.

Unlike previous SVM frameworks, Al-Barazanchi et al. (2015) decided to substitute the time-consuming step of feature engineering with a CNN as a feature extractor. Hence, they fed a 10-layer CNN with resized images possessing 32x32 pixels, which were comprised of seven plankton categories captured with the SIPPER. Then, they trained this CNN until the maximum accuracy was achieved. From the obtained network, they employed each hidden layer as a feature for training another classifier. Thus, they separately evaluated a random forest and an SVM classifier with two different training sets, the original dataset and the same dataset extended via data augmentation (random image rotation), and with every layer as a feature vector. Finally, they obtained the best accuracy of 96.7% with the SVM classifier and the augmented training set using the features of hidden layer 1. With this technique, they outperformed other works’ accuracies in terms of classifying the same dataset (Tang et al. 2006; Zhao et al. 2005, 2009, 2010; Li et al. 2014). This was the second reviewed paper, after the work of Dollfus and Beaufort (1999), that replaced the cumbersome feature extraction step with another original technique, allowing for the quick addition of new plankton species.

Fig. 8
figure 8

Classification approach proposed by Cheng et al. (2020)

In a dataset, a plankton category contains dozens of images of the same species in multiple postures and positions. Nevertheless, a model requires even more images to attain a high classification accuracy. Some works, such as the previous study described above, employed data augmentation, which consists of artificially increasing the dataset's size by applying various transformations to the already labelled data, such as rotation, translation and flipping. Cheng et al. (2020) presented a new method to make a classifier invariant to plankton rotation, thereby eliminating the need for data augmentation. They first described each image with a polar representation, which consisted of depicting each pixel in terms of its distance and angle to the centre of the image, converting the rotational problem into a translational task. Thus, there exist classifiers such as CNNs that are robust to small translations due to their internal architectures (pooling layers). The combination of polar representations and their use in a CNN is a powerful technique for preventing the misclassification of two similar plankton images when one appears rotated relative to the other. With this idea in mind, the authors inputted each original image into a CNN and inputted its polar representation into another CNN. They trained each CNN independently and merged the outputs of their fully connected layers as the features for a subsequent classifier: an SVM (Fig. 8). Finally, the SVM obtained a 98% accuracy in terms of categorizing seven plankton species captured with PlanktonScope. This technique even outperformed a DenseNet CNN employing augmented data with 14 times the original size, which achieved 96.5% accuracy. This work solved one of the major issues of underwater artificial vision applications: the misclassification of similar objects captured at different angles. While polar representations were employed for plankton in the past (see the first reviewed paper, Schlimpert et al. (1980)), the authors presented an advancement based on the properties of CNNs. This framework can be valuable if combined with other techniques that address the problem of unbalanced categories in a dataset.

4.5 Semiautomated classification

Semiautomated classification aims to improve the accuracy of a classification model by combining the strengths of machines and human intelligence. Thus, a machine performs some tasks while a human performs other tasks, such as the initial grouping of data and the final classification decision. However, the requirement for human interaction is kept to a minimum. This approach is advantageous when labelled data are scarce, or new images are complex for a machine to label. This method’s application dates to the early 1990s when researchers needed to perform simple classification tasks, such as sorting images into categories, with some input and guidance from human operators.

Luo et al. (2004b) updated their previous SVM classification approach (Luo et al. 2004a) by adding active learning. First, they built a multiclass SVM based on a training set and an unclassified set. Then, they ran the model, and an expert labelled the images from the unclassified set that were classified with the lowest confidence. Later, these manually labelled images were added to the training set and the model was retrained. This procedure was iterated several times until the desired accuracy of 86.7% was achieved. The main advantage of this method was that it required four times fewer images to produce the same accuracy as that of the classifier built on a random training set, as shown in Fig. 1 of their paper. Additionally, the iterative retraining process saved valuable time for the operators because it only requested labels for the most challenging images. This method is a reliable way to build a plankton dataset.

Grosjean et al. (2004) tested different automatic classifiers to categorize 29 groups of zooplankton. These images were acquired with ZooScan (Gorsky et al. 2009), a system built to scan biological samples. Among the 15 tested classifiers (such as LDA, HC, ANN), they obtained the highest accuracy of 74.6% with a double-bagging LDA classifier employing 27 textural and shape features. Nevertheless, the authors thought they should try another approach to attain higher accuracy due to species intravariability and the likelihood of training set mislabelling. Thus, they implemented a system to manually retag the items classified with low accuracy, improving the performance to 85% with a speed of 10,000 items in less than 5 s. The authors confirmed that human intervention is essential for managing the complexity of categorizing many species.

Bell and Hopcroft (2008) assessed the open-source ZooImage software on Alaska zooplankton to determine if it would be reliable for future research on salmon survival rates. ZooImage was based on the R project and the ImageJ software developed by the National Institutes of Health of the U.S. government to process biomedical images (Rasband 1997; Abràmoff et al. 2004). First, they scanned the samples using a conventional flatbed desktop scanner (Epson Perfection 4990). Then, the authors used the default ZooImage configuration, which created vignettes of every particle (segmentation) found on the image for posterior feature extraction. In this study, no information about the type of features employed by ZooImage was included. However, they probably consisted of textural and shape features, as employed in other works using ZooImage (Table I, Fernandes et al. (2009)). Next, they created a training set and evaluated six classification algorithms: LDA, a recursive partitioning tree, K nearest neighbours, LVQ, an ANN, and a random forest (RF). They obtained the best accuracy of 81.7% with the RF algorithm by categorizing 53 species on field-collected samples, reducing the time required for quantitative analysis from hours to less than one hour. When they removed the discarded category, which accounted for more than 75% of all particles, the accuracy fell to 63.3%. Although they considered that the results were valuable, they anticipated that a semiautomated procedure was required to attain higher accuracies, expecting its inclusion in future software updates. As mentioned in this section’s reviewed works, the semiautomated method is regarded as a technique for improving the accuracy of any model.

Gorsky et al. (2010) developed a new software for plankton classification: the Plankton Identifier (PkID). They performed a classification sequence composed of ZooScan, ZooProcess and Plankton ID. First, the images were acquired with ZooScan, as mentioned in a previous work. Next, they employed ZooProcess software (based on ImageJ software) to preprocess, segment, and extract features. Then, they manually created a training set with 30 categories, which was reduced to 20 due to the low accuracy shown on some categories in PkID cross-validation tests. Nevertheless, they kept some categories with low accuracy because of their ecological values. This reduction was made iteratively by merging categories with similar ecological values, resulting in higher accuracy (Fernandes et al. 2009). This training set achieved an overall accuracy of 78% using an RF algorithm. According to the authors, this performance was inaccurate for ecological studies because it is driven primarily by less abundant taxa. Thus, they decided to increase the number of categories from 14 to 42, assisting the machine with labelling when needed and resulting in 100% accuracy. The authors clearly showed how this semiautomated classification method is the best way to obtain the highest accuracy in taxonomic research at the expense of taking more time than the fully automated approach.

A year later, Ye et al. (2011) presented a semiautomated method based on a Bayesian probabilistic model. A Bayesian probabilistic model represents uncertainty about events using probabilities; it starts with a prior premise about an event (prior probability) and then updates that premise with new information (likelihood) to obtain a new probability (posterior probability). After performing ZooScan image acquisition, the authors manually categorized every object, building unbalanced and balanced training sets comprised of 28 planktonic and nonliving categories. The balanced training set had the same number of images in each class. In contrast, the unbalanced class had different numbers of images in various classes. ZooProcess extracted features, and a naive Bayesian classifier was trained, obtaining a relationship between the posterior probability and cumulative recognition accuracy for each category. New images were automatically classified based on the posterior probabilities, obtaining accuracies of 69% and 68% for the unbalanced and balanced training sets, respectively. Second, over 31% of the images with the lowest confidence levels were manually reclassified by experts, obtaining 92% accuracy for both the unbalanced and balanced sets. This semiautomatic approach outperformed the method of Grosjean et al. (2004), and it offered the possibility of providing the results for any aggregation level with a single calculation step due to the Bayesian probability model. As stated in Bayesian probability model theory, this technique seems ideal for plankton classification because it can handle category intravariability, missing data in unbalanced categories, and noisy datasets. However, as embodied in the previously reviewed works, a semiautomated approach can raise any model to its maximum accuracy.

Fig. 9
figure 9

Cost-effective active learning framework proposed by Wang et al. (2017)

Bochinski et al. (2019) applied a variation of active learning to zooplankton for the first time: cost-effective active learning (CEAL, Fig. 9 (Wang et al. 2017)). Similarly, while active learning tries to maximize a model’s accuracy regardless of the required human effort, CEAL tries to avoid unnecessary human intervention, hence the “cost-effective" aspect of the name. First, images were segmented using Otsu thresholding, but only when rotation was needed because the authors found that segmentation techniques removed plankton details. Then, they employed a CNN based on AlexNet and tested this method with two databases comprised of four zooplankton species acquired with the In Situ Ichthyoplankton Imaging System (ISIIS) (Cowen and Guigand 2008). As a result, CEAL achieved the highest accuracy compared to the bare CNN and other active-learning methods; therefore, CEAL should be considered for future developments focused on semiautomated approaches.

4.6 Classification via convolutional neural networks

Convolutional Neural Networks (CNNs) are extensions of ANNs that can take an image as input. The name is derived from convolution, which is an operator that can merge two matrices by multiplying them. When correctly employed, convolutions can be used to perform edge detection, image blurring or sharpening. The topology of a CNN considers multiple convolution layers to extract key image features. Interspersed with these layers, a CNN includes pooling layers, which are in charge of summarizing the features within a group of cells in the previous layer. Thus, the CNN also compresses the input data further. The final layer of the CNN serves as an input for an ANN to conduct pattern classification. CNNs were first introduced in the late 1980s by Yann LeCun and his colleagues at the Pierre and Marie Curie University in Paris and AT &T Bell Labs. They have become state-of-the-art approaches in computer vision applications, including image and video classification, object detection, and image generation.

Members of the reservoir laboratory (Ghent University - Belgium) won the US National Data Science Bowl (NDSB) competition by classifying 121 classes of plankton acquired with the ISIIS (Dieleman et al. 2015). First, they performed zero-mean unit-variance (ZMUV) normalization and data augmentation. ZMUV normalization is a common preprocessing step used in machine learning to ensure that the given data possess a consistent scale and have similar properties. The authors combined two different rescaling strategies due to the species’ size diversity; one was based on image size, and the other was based on a fixed factor. For classification purposes, they employed a 13-layer CNN structure inspired by OxfordNet and an ANN employing the most discriminative conventional features, such as image sizes, image moments, and Haralick textures. Inspired by their previous participation in the Galaxy Challenge, they added cyclic pooling to the CNN to make the network more robust to the rotational plankton variations. Finally, they fused the results of these methods, obtaining a top-5 accuracy of over 98%, demonstrating their approach as a helpful tool for aiding experts in labelling tasks. They trained the models on Nvidia GPUs with convergence times from 24 to 48 h, depending on the model parameters. As a drawback, CNNs tend to overfit due to their complexity and the number of parameters relative to the training data, resulting in poor performance on new images. The authors overcame this issue for classes with as few as 20 samples per category by applying dropout, weight decay, data augmentation, pretraining, pseudolabelling, and parameter sharing techniques. In addition, cyclic pooling has become so powerful that neural networks centred on it have emerged, such as CyCNN and R2CNN. This work proved that high accuracy is possible with few samples, contrary to what is conveyed in deep learning theory.

Dai et al. (2016) presented ZooplanktoNet, a framework based on a CNN to classify a 13-class dataset acquired with ZooScan and provided by the Laboratoire d’Océanologie de Villefranche-sur-Mer, France. First, they normalized and augmented the dataset via rotation, translation, rescaling, shearing, and flipping. Then, they tested four popular CNN architectures on an augmented and nonaugmented dataset as reference points: AlexNet, CaffeNet, VGGNet and GoogleNet. The models’ accuracies achieved on the augmented dataset were approximately 10% superior to those of the nonaugmented versions. Then, the authors chose AlexNet to continue their experiments but tested different optimizations for this CNN: different network depths, sizes and numbers of convolutional filters and the advantages of using local response normalization (LRN) and rectified linear units (ReLUs). LRN is a preprocessing step that is commonly used in CNNs to prevent overfitting. A ReLU is an activation function that enables the learning of complex shapes and textures, improving overall model accuracy. The researchers obtained the best results with an 11-layer network, using LRN and parametric ReLU (PReLU) activation functions instead of the widely known ReLU function. The PReLU function resolves the dying ReLU problem and performs better, as shown in this work. Thus, this architecture was denoted as ZooplanktoNet, achieving an accuracy of 93.7% versus the 91.3% of the standard AlexNet at the cost of requiring ten more minutes of training time (a total of 38 min). The authors documented how a standard CNN such as AlexNet could be adjusted to obtain higher accuracies depending on the nature of the given dataset. Nonetheless, this model was designed specifically for images acquired with ZooScan, with unknown performance for images acquired by other means.

Xiu and Zuoying (2016) applied the residual CNN (ResNet, He et al. (2016)) that was recently developed by Microsoft Research to the 121 classes of plankton dataset employed in the 2015 National Data Science Bowl (NDSB) competition. The idea of this work was to overcome the accuracy decrease induced when increasing the depth of a CNN to 20 layers or more. After rescaling and augmenting the dataset, they trained three residual CNNs with depths of 19, 32 and 50 layers plus a VGG-19 network as a reference, keeping the remaining parameters the same. The best network was the 32-layer network, with a top-5 accuracy of 95.8%. Although this accuracy was lower than the 98% achieved by the winning team in the NDSB competition, its computational cost was stunning, running 6.5x faster at a rate of 0.1 s per test set image. In general, ResNets perform better than traditional CNNs because they can train much deeper networks without incurring the vanishing gradient problem. The vanishing problem has been overcome by allowing the gradients to bypass one or more layers (as shown in Fig. 4 of their paper); this new internal structure is called “residual connection”. Moreover, during inference, ResNets are expected to run slower than traditional CNNs because they must perform calculations with both the original and residual inputs. They also use more layers, increasing the required computation time. Contrary to the above, the authors achieved a lower accuracy and a faster inference time compared to the VGG-19 network used as a reference. Perhaps this behaviour was due to the small dataset size, which the authors blamed for the result obtained in the 50-layer network test.

Fig. 10
figure 10

Hybrid CNN based on AlexNet for plankton classification (Dai et al. 2017)

One year later, Dai et al. (2017) presented a classification system composed of three CNNs in parallel. The idea was to take advantage of textural features that describe differences when shapes do not and vice versa. To extract shape information, they first removed noise, smoothed shapes, applied the Scharr operator and optimized the contrast. To obtain texture information, they employed a Canny edge detector. Then, three CNNs were arranged in parallel; the first was fed with shape features, the second was directly fed with the images, and the third was given textures, converging into a softmax function (final layer) by a pyramid structure, as shown in Fig. 10. They classified a dataset containing 30 plankton classes acquired with the Imaging FlowCytobot (IFCB, Olson and Sosik (2007)). After completing several tests, they obtained maximum accuracies of 95.8% when employing networks based on AlexNet and 96.3% when using GoogleNet. AlexNet is an 8-layer CNN introduced in 2012 by researchers at the University of Toronto. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 with a top-5 error rate of 15.3%. GoogleNet is a 22-layer CNN introduced in 2014 by researchers at Google. It won the ILSVRC in 2014 with a top-5 error rate of 6.6%. All tests were run on four 6-GB Nvidia GTX 980Ti GPUs. Although the authors tried to employ as much image information as possible, they could barely achieve a 1% accuracy improvement over a single CNN (either AlexNet or GoogleNet) fed with only the original images as input, as shown in Tables 2 and 3 of the original paper. Thus, this method runs contrary to the trend of avoiding complicated feature calculations, showing no benefits.

Fig. 11
figure 11

Classification approach with a single CNN based on AlexNet (Cui et al. 2018)

Cui et al. (2018) presented an alternative approach to that used in the previous work. The authors also combined original images with shape and textural features as classifier information. However, they obtained shape features by applying Gaussian low-pass filtering and textural features by applying Gaussian high-pass filtering followed by logarithmic image enhancement. Then, they merged the shape features, the texture features, and the original image into a more complex feature representation using a concatenation layer. Finally, the authors input the merged vector into a single CNN based on AlexNet (Fig. 11). This approach obtained 96.6% accuracy, outperforming the framework based on AlexNet from Dai et al. (2017) by 0.8%. Although this procedure did not achieve a noticeable accuracy improvement, its simplicity compared to the previous method makes it preferable for future framework developments.

Al-Barazanchi et al. (2018) applied a CNN classifier to the same dataset employed in past works comprising seven plankton species (Tang et al. 2006; Zhao et al. 2005, 2009, 2010; Li et al. 2014; Al-Barazanchi et al. 2015). After resizing all the dataset images, they fed the images into an 18-layer CNN (based on VGGNet), achieving 98.2% accuracy on the testing dataset. Their CNN outperformed the best result from previous works, which achieved 96.7% accuracy using a 10-layer CNN for feature extraction and an SVM as the classifier (Al-Barazanchi et al. 2015). This paper demonstrated how the unique use of a deep CNN outperforms any other method without data augmentation. However, it would have been interesting to compare this architecture with their previous work but instead using a CNN with the same depth as the feature extractor.

Luo et al. (2018) improved the parameters of the spatially sparse CNN formerly employed in the 2015 NDSB competition, which finished third in the final rankings. In a traditional CNN, each neuron in a layer receives input from all the neurons of the previous layer. In a spatially sparse CNN, each neuron in a layer receives input from a subset of the previous layer’s neurons through a process called pruning. With these connection reductions, the network first learns the most important features. Then, it continues to learn more complex features with less important connections. Researchers created this type of CNN to recognize Chinese characters, where strokes were considered sparse pixels and the background was omitted for processing purposes. In this work, the authors considered a plankton image to be similar to a handwritten character, where most of it was background. In this way, a spatially sparse CNN could learn faster, with less overfitting, and with a lower computational cost than that of a dense CNN. They classified the 2015 NDSB competition dataset by regrouping the training set into 37 classes according to ecological significance. After conducting contrast normalization and discarding noisy and low-confidence images, data augmentation was performed on the rare classes via rotation, skewing, and scaling. Their 13-layer spatially sparse CNN obtained an F1-score of 0.881 and a precision of 90.7% on nonrare biological groups (23 classes), taking approximately 24 h for training and 165 h for prediction on new images using 1536 Nvidia CUDA cores. The result was quite impressive, considering the dataset size of 23.4 million images. However, this model may not be suitable for general plankton images, as they have high variability in their texture and illumination conditions, requiring more connections among neurons to process complex images. Moreover, this technique may soon become obsolete with the emergence of colour plankton databases, which present high degrees of variability.

Soh et al. (2018) presented a CNN architecture capable of identifying individual plankton in images with homogeneous clumping, heterogeneous interspersion or both. Homogeneous clumping occurs when plankton of the same species cluster together homogeneously. Heterogeneous interspersion occurs when different plankton species are randomly distributed in a sample. These phenomena can affect the accuracy of plankton classification. Thus, the authors developed a 19-layer CNN based on the “You Only Look Once" (YOLO) architecture, which was capable of identifying and categorizing multiple objects in an image using just one computation. After conducting manual labelling, they performed data augmentation: six geometric distortions and two-pixel value modifications. Next, they applied this classifier to 18 species of plankton, detecting 100% of the validation set with an average intersection of union (IoU) of 86%. The IoU is a metric for object detection algorithms that measures the overlap between the predicted box and the ground-truth bounding box for an object in an image. This model took 20 h for training, running at 50 frames per second on the test set using one Nvidia GTX980. Due to the nature of the classifier, this method is a perfect fit for coping with scattered plankton, but distinguishing between clumped plankton and individuals is a different challenge, perhaps requiring extra preprocessing steps not covered in this work.

Lumini and Nanni (2019) tested a classification approach consisting of the fusion of different CNNs. They tested three plankton datasets on each of the following pretrained networks: AlexNet, GoogleNet, InceptionV3, VGG16, VGG19, ResNet50, ResNet101, and DenseNet. A pretrained network is a model that has already trained on a large dataset, the ImageNet database in this case. Then, the parameters of these models were adapted to the new dataset. For this task, the authors employed the sequential floating forward selection (SFFS) feature picking method, which selects the most informative features from a pretrained models that are useful for a new dataset. The retraining step was performed under three different strategies: one round of tuning (1R), two rounds of tuning (2R) and preprocessing tuning (PR). 1R consisted of fine-tuning a pretrained model with a new dataset in a single round, 2R uses two rounds, and PR involves fine-tuning a preprocessed new dataset in a single round. For PR, the authors implemented preprocessing by applying data augmentation with four techniques: gradient, orient, local binary pattern, and local ternary pattern transformations. The three tested datasets were already used in past studies; they were acquired with the IFCB, ZooScan, and ISIIS. The best single pretrained model was DenseNet, but the highest accuracy was obtained with an ensemble method that consisted of employing the sum rule among the eight previous networks using one round and two rounds of fine-tuning, as shown in Table 6 of their paper (Fus_2R + Fus_1R). In a comparison, this approach obtained an F-measure of 0.953 versus the 0.9 achieved by Zheng et al. (2017) on the same IFCB dataset. They published another study that outperformed this ensemble’s accuracy. It employed the SFFS feature selection approach but instead selected which classifier to add to the ensemble in each iteration based on its performance (Lumini et al. 2019). Thus, the SFFS ensemble consisted of 11 classifiers, obtaining an F-measure of 0.958 on the same IFCB dataset. As in the previously reviewed works, the authors revealed how an ensemble classifier could outperform individual CNNs. The use of pretrained models could be a powerful time-saving technique when the new target is a high-dimensional and complex dataset. Nevertheless, the complexity of running several networks to fuse their results later could be challenging. Further research may be needed to determine if the time required to implement this ensemble technique is shorter than that required to optimize a single CNN with layers adapted to the nature of the input images.

Li et al. (2021c) presented a new model that addressed two problems encountered in plankton classification: imbalanced categories and the loss of subtle features by a CNN during training. Imbalanced categories frequently occur in datasets for plankton classification tasks, where rare or poorly known taxa exist, negatively affecting a model's accuracy. Retaining subtle features, such as the shape of a head or the presence of appendages, is crucial for accurately classifying different plankton species. The authors solved the imbalance issue by creating fake images in the rare taxa categories using CycleGAN, a generative adversarial neural network that generates new images similar to those employed as inputs and described by Yann LeCun, Chief Artificial Intelligence Scientist at Facebook, as “the most interesting idea in the last ten years in machine learning" (LeCunn 2017). Then, the authors proposed a densely connected structure based on the YOLOv3 model to better capture the subtle features of plankton. In a densely connected structure, each layer is connected to every other layer, rather than just the previous and successive layers (as in traditional CNNs), allowing for a more robust representation of the input data. The authors applied this method to classify the IFCB dataset, obtaining an average precision of 97.1% for eight classes. This method outperformed others, with a detection time of 51 ms per testing image, which is suitable for real-time applications. Although the experiments were run in a desktop environment (Nvidia Titan RTX GPUs), they successfully integrated this method into a low-power Jetson Nano board, looking for its implementation in autonomous underwater vehicles (AUVs). This latest work marked a turning point in plankton image classification, presenting a model that not only offers high accuracy, but is also likely to be seen soon in an in situ device performing real-time plankton classification.

5 Conclusions and future work

Scientists began operating machines produced in other fields to quantify and classify plankton using pattern recognition techniques. The research started with Quantimet (IMANCO 1970), a system that was initially designed for analysing steel and then employed in multiple areas from materials to life sciences. Similar hardware but specifically made to digitize plankton specimens was developed after.

As Fig. 12 shows, early systems employed linear classifiers to group plankton by their shape features, but these techniques were still unable to achieve species-level identification due to poor image quality. In Europe, knowledge-based classifiers followed for plankton grouping, reaching a lower accuracy than linear discrimination but classifying twice the number of groups. In these studies, some authors removed appendages to ease the feature extraction process and separated the specimens connected by antennae or swimming legs. In the early nineties, ANNs were introduced to categorize plankton, achieving similar performance to that of taxonomists when combining shape features with texture features for the first time in field-collected specimens. Some authors focused on which features resulted in higher accuracy instead of fine-tuning their network parameters, and others combined the results of several ANNs running in parallel. SVMs were debuted in plankton classification during the same time period, using local features and CNN-like feature generators and outperforming past techniques. In addition to these approaches, researchers started to assist automatic classifiers with manual labelling for specimens with lower accuracy, achieving better results. Nonetheless, including a discarded category was a simple way to obtain these promising results (Hu and Davis 2005; Bell and Hopcroft 2008).

Fig. 12
figure 12

Temporal coverage of reviewed papers

Recently, CNNs have defined a new era in image classification, shifting the focus of researchers from feature engineering to network parameter optimization. These networks run faster and obtain better results than those produced using a different classifier on the same dataset, as shown in the literature. The state-of-the-art accuracy achieved by a CNN for a 4-121 class problem exceeded 90%. Authors used existing architectures for plankton classification and produced new models adapted to this specific issue with parallel configurations or internal CNN structure modifications. Plankton recognition has lagged behind other applications, but Asian authors applied the YOLO architecture to plankton detection; this is the most popular object detection algorithm, which is able to process images in real time and with maximum accuracy compared to any other network developed to date. Eventually, seventy years after Nishizawa et al. (1954) performed direct underwater observations, research groups became close to embedding high-end algorithms into underwater systems to categorize plankton in near real-time for the first time. Perhaps, the answer to complete plankton identification during this century will be a multipurpose system composed of acoustics, cameras, and DNA barcoding, where artificial vision would play a key role.

Future research on plankton image classification should focus on generating new datasets comprising high-quality colour plankton images, such as the DYB-PlanktonNet dataset. For this purpose, new tools and equipment will be necessary, such as the new in situ underwater imaging system developed by the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (Li et al. 2021a). Colour images offer more information to classifiers, making it easier to achieve higher accuracies. A second step should focus on integrating state-of-the-art classifiers into low-power, high-performance dedicated hardware for classifying plankton in situ and in real time. The above strategies, together with advanced energy harvesting modules, will enable the automatic monitoring of plankton worldwide at different depths under an international agenda, such as the Argo program (Scripps Institution of Oceanography 2000). The Argo program is an international project that collects ocean parameters between the bottom of the ocean and the ocean’s surface through approximately 4000 autonomous robotic instruments. If such a device is developed, it would be a game changer in terms of studying and understanding plankton populations, potentially revolutionizing our understanding of the ocean and its role in the climate system.