Dataset exploited for the development and validation of automated cyanobacteria quantification algorithm, ACQUA

The estimation and quantification of potentially toxic cyanobacteria in lakes and reservoirs are often used as a proxy of risk for water intended for human consumption and recreational activities. Here, we present data sets collected from three volcanic Italian lakes (Albano, Vico, Nemi) that present filamentous cyanobacteria strains at different environments. Presented data sets were used to estimate abundance and morphometric characteristics of potentially toxic cyanobacteria comparing manual Vs. automated estimation performed by ACQUA (“ACQUA: Automated Cyanobacterial Quantification Algorithm for toxic filamentous genera using spline curves, pattern recognition and machine learning” (Gandola et al., 2016) [1]). This strategy was used to assess the algorithm performance and to set up the denoising algorithm. Abundance and total length estimations were used for software development, to this aim we evaluated the efficiency of statistical tools and mathematical algorithms, here described. The image convolution with the Sobel filter has been chosen to denoise input images from background signals, then spline curves and least square method were used to parameterize detected filaments and to recombine crossing and interrupted sections aimed at performing precise abundances estimations and morphometric measurements.


a b s t r a c t
The estimation and quantification of potentially toxic cyanobacteria in lakes and reservoirs are often used as a proxy of risk for water intended for human consumption and recreational activities. Here, we present data sets collected from three volcanic Italian lakes (Albano, Vico, Nemi) that present filamentous cyanobacteria strains at different environments. Presented data sets were used to estimate abundance and morphometric characteristics of potentially toxic cyanobacteria comparing manual Vs. automated estimation performed by ACQUA ("ACQUA: Automated Cyanobacterial Quantification Algorithm for toxic filamentous genera using spline curves, pattern recognition and machine learning" (Gandola et al., 2016) [1]). This strategy was used to assess the algorithm performance and to set up the denoising algorithm. Abundance and total length estimations were used for software development, to this aim we evaluated the efficiency of statistical tools and mathematical algorithms, here described. The image convolution with the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/dib Sobel filter has been chosen to denoise input images from background signals, then spline curves and least square method were used to parameterize detected filaments and to recombine crossing and interrupted sections aimed at performing precise abundances estimations and morphometric measurements.
& 2016 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Subject area
Biology, Mathematics More specific subject area

Image analysis
Type of data Bright field images are often affected by a background color gradient and a slightly contrast, the convolution obtained by Sobel Filter represents, in just one operation, a fast computational solution for both problems.
The parameterization of filamentous cyanobacteria through Spline curves approximated using the least square method represents a solution to recombine crossing and overlapping filaments and to perform accurate morphometric measurements and abundance evaluations.

Data
Data shown in this paper represent the evaluation datasets of ACQUA [1]. Data samples compared 4 different background condition incoming form different Italian Lakes (Fig. 1). For each water sample, 10 images were acquired and carefully manually analyzed by four independent operators. More that 500 potentially toxin cyanobacteria filaments were analyzed to compare Automatic vs Manual estimations, scatter plots and comparison tables from different data sets are shown. Mathematical methods used to perform elaboration are finally described to understand the pre-processing algorithm.

Experimental design, materials and methods
The presented experiments aim to assess the precision of ACQUA highlighting filaments of interest and evaluating abundance and total length of natural water samples. Each image was carefully recognized to obtain filament abundance, lengths, width, genus, crossing and interrupted sections and other statistical parameter. Mathematical algorithms used to this aim are also described and the implemented Matlab code is available.
Sobel filter convolution was used as a central operation of the pre-processing algorithm. Spline curves combined with least square method were exploited as optimal solutions to parameterize, measure and reassemble interrupted filaments.

Microscopy and image acquisition
Imaging was performed using a light microscope (Carl Zeiss Axiovert 100) with a 10 Â /0.25 objective (Achrostigmat) and a digital camera Canon 600D (18 Mpx). Image dimension was 5184 Â 3456 pixels and resolution at 10 Â magnification was 0.32 μm/px. The covered area of each Field Of View (FOV) was 1.82 mm 2 .

Sample preparation
Natural, mixed species phytoplankton samples were collected from three volcanic Italian lakes: Lake Albano, Nemi and Vico (at two stations Caprarola and Ronciglione), in June and November 2014. At each sampling station, 1 L sample was taken using a Niskin bottle, at À 5 m depth and fixed with Lugol's iodine solution. 25 ml of subsamples were sedimented in Utermӧhl chambers (24 mm width Â 55 mm height) for 8 h. For each sample, 10 non-overlapping Fields Of View images were acquired. Filament abundance and total length estimation were calculated automatically by ACQUA and manually by four independent operators using the Java based software ImageJ. Image dataset is available on line at http://www.mat.uniroma2.it/ $gandola/ACQUA/-Figure2DiB.rar.

Data set for internal validation of natural sample
Water samples were collected between June and November 2014 in four different sampling stations to collect different background noise from different lakes and seasons. Scatter plots in Fig. 2a, c, e, g show the comparison between automated and manual estimations of filament total length while graphs in Fig. 2b, d, f, h show the correlation in terms of filament abundance for each FOV. Results regarding the slope of major axis fitting lines for Caprarola, Ronciglione and Albano samples are between values of 0.96 and 1.18 that indicates a good linearity response of ACQUA estimations. In addition, r 2 value is always 40.84 and p value is always o0.0001 to assess the significance of results. In the case of the Nemi sample data are biased by the presence of a large amount of background noise. To compare background noise between different samples we have calculated the percentage of interesting elements on FOVs. For example, for Caprarola abundance estimation (Fig. 2d), the average value of the total area covered by filamentous objects is 54.5% and of these 54.3% are filaments of interest (Table 1). In these conditions the correlation of the fitting line is r 2 ¼ 0.97 and the related slope is 0.99 (Fig. 1d). In the case of Nemi sample (Fig. 2h) only 27.4% of objects present in images were filamentous and of these just 19.1% represented elements of interest and in many cases filaments of interest were o100 mm length. Under these conditions, the r 2 value is 0.72 and the slope value is 0.75. This example represents a limit case characterized by a low concentration of toxic filaments of interest and a large amount of background noise. These data confirm that the linearity of the results is confirmed in the most common conditions of natural water samples but a particular attention have to be used in presence of large amount of background noise.
Data shown in Table 1 also allow an analytical analysis of each image for any parameter measured i.e.: filament count, filament total length, number of correctly reconstructed filaments, number of total filamentous object and percentage of filamentous object area on total area covered by any elements. The 83% of broken and overlapping filaments are successfully reconstructed by ACQUA, ensuring a good estimation of both filament length estimation and abundance.  Fig. 1a, c, e, g shows the comparison between automated vs. manual estimations of filament total length. Fig. 1b, d, f, h shows the correlation between automated vs. manual estimation of filament abundance. Regression lines are calculated through the Major Axis fitting algorithm and horizontal error bars represent the Standard deviation between manual estimation permed by four different operators.

Convolution with Sobel Kernel
To emphasize local contrast we have chosen to perform a discrete bidimensional convolution (Formula A.1) between the input image, that is represented by the function f and the Sobel Kernel (Formula A.2), that is represented by the function g. Symbols y and x represent the height and the width of input image respectively and symbols n 1 and n 2 represent the filter indices. The Sobel Filter Manual and automated count and filament length are given, and typical issues such as reconstructed filaments, total filamentous objects in FOV and percentage of filament elements to noise are quantified.
is composed by two different kernel (g v to enlarge vertical contrast and g h to magnify horizontal contrast) that requires two different passes to obtain the final recombined function H.

Transparency document. Supplementary material
Transparency data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2016.06.042.