Multimodal image and spectral feature learning for efficient analysis of water-suspended particles

: We have developed a method to combine morphological and chemical information for the accurate identification of different particle types using optical measurement techniques that require no sample preparation. A combined holographic imaging and Raman spectroscopy setup is used to gather data from six different types of marine particles suspended in a large volume of seawater. Unsupervised feature learning is performed on the images and the spectral data using convolutional and single-layer autoencoders. The learned features are combined, where we demonstrate that non-linear dimensional reduction of the combined multimodal features can achieve a high clustering macro F1 score of 0.88, compared to a maximum of 0.61 when only image or spectral features are used. The method can be applied to long-term monitoring of particles in the ocean without the need for sample collection. In addition, it can be applied to data from different types of sensor measurements without significant modifications.


Introduction
In situ analysis of liquid-suspended particles has applications in environmental monitoring, healthcare, and water quality control [1][2][3].Particularly, monitoring of suspended particulate matter in the ocean requires the relative abundance of different particle types to be understood [4,5].Often these particles have sparse distributions (10 to several hundred particles/L) [6].Non-destructive methods such as digital holography can image suspended particles in large volumes (∼12 mL/s) of water with a high spatial resolution (∼20 µm) without the need for any sample preparation [7][8][9][10].Digital holographic cameras have been extensively used in marine monitoring to obtain information about particle size and shape [11][12][13], using machine learning techniques [14][15][16] to automatically identify different particle types.However, for particles like microplastics, morphological information alone is not sufficient to distinguish the different materials [17].Knowledge of their chemical composition is important to understand the origin, route, and consequences of environmental pollution [18].Recently, the authors demonstrated holographic imaging and Raman spectroscopy for non-destructive analysis of water-suspended microplastic particle composition [19].While the Raman spectroscopic analyzers previously used for in situ surveys observed backscattered lights from a target [20,21], our setup observes forward scattered light and shows that both holographic imaging and Raman spectroscopic signals can be obtained from water-suspended particles using a single, compact optical setup.While the optical setup to perform combined imaging and spectroscopic measurements of particles has been demonstrated, it is also necessary to develop analytical methods that can efficiently process multimodal data in order to take full advantage of such a setup.For multimodal data fusion analysis, the audio-visual emotion challenge to develop machine learning methods for automatic audio, visual and audiovisual emotion analysis is a well-known topic [22].Similar to how human beings naturally process multimodal information [23], a number of publications have reported improvement of the recognition accuracy of emotions by multimodal fusion analysis of speech data (e.g., vocal effect) and visual data (e.g., face expression) from unimodal analysis [24][25][26].In addition, novel multimodal deep-learning based methods have been demonstrated to further increase the accuracy [27,28].Data fusion applications have been expanded to a wide range of multi-sensory data analysis [29], such as biomedical diagnostics [30,31], pharmacy [32,33], automatic robot navigation [34], and remote sensing [35].However, the previous methods have not been applied to the identification of marine particle types/materials due to the limitation of multiple sensory applications to analyze particles.
In this paper, we demonstrate the automatic clustering and classification of different types of marine particles by applying a simple data fusion technique to morphological (i.e., holographic images) and chemical (i.e., Raman spectra) data.We propose a multimodal learning method using autoencoders and further t-SNE dimensionality reduction, and compare the classification accuracy between uni and multimodal data with and without t-SNE.We investigate how unsupervised feature learning methods can be used to automatically extract and further combine multimodal features from different types of sensor measurements, and use these to efficiently identify different particle types.

Samples
Experiments were performed on plankton, foraminifera, minerals and microplastic particles, where these were chosen based on their relevance to climate change and pollution monitoring [1,36].These were measured in artificial seawater, which is often used for method validation for marine sensing applications [37][38][39], to minimize the effect of water quality fluctuation on images and spectra.Plankton absorbs around 50 billion tons of carbon each year, accounting for 40% of atmospheric CO 2 removal [40,41].Removed carbon is either stored as organic carbon as in the case of the copepods used in our experiments, which are one of the most abundant zooplankton species in the ocean, or as inorganic carbon as in the case of foraminifera, a single-cell organism with an external shell made of calcium carbonate.Our experiments also study sphalerite rock fragments, which are a common sulfide mineral in ores.The ability to monitor sulfide particle distributions is important for studying the potential impacts of sub-sea mining [42].Finally, we investigate polypropylene (PP) and polyethylene (PE) microplastic pre-production plastic pellets (nurdles).PP and PE are selected since these are the most common types of microplastics found in aquatic environments [43].We also investigate PE fragments that were collected from the ocean.The particle types and sample numbers for each type are summarized in Table 1.Copepods were collected from the surface seawater during the KM20-11 cruise of the research vessel (R/V) Kaimei in December 2020 and kept in a freezer to preserve their morphological characteristics.The samples were defrosted using lukewarm water before the measurement.Dried foraminifera samples (Calcarina gaudichaudii) were collected from Okinawa, Japan.The sphalerite rock fragments were collected from Daikoku Ore in Saitama, Japan.PP and PE nurdles were provided by Daikei Chemical, Inc. PE fragments were recovered from the surface seawater in Osaka Bay, Japan in September 2018.These samples were separated from other particles by first dissolving biotic organic matter and performing Fourier transform infrared spectroscopy on the dried residue to identify the PE fragments.All particles used in our experiments had a dimension between 1 and 5 mm, and 3 different samples of each particle type were measured to assess the performance of our method.

Setup
The integrated in-line holographic imaging and Raman spectroscopy setup used in our experiments is shown in Fig. 1 and has previously been described in Ref. [19].A quartz glass cell of length 20 cm, and diameter 20 mm (Starna Scientific, 34-Q-200) was filled with artificial seawater and illuminated by a collimated laser of 10 mm beam diameter.A single longitudinal mode continuous wave (CW) laser (Oxxius, LCX-532S-300) beam with a wavelength of 532 nm was delivered via a single-mode fiber.The exiting beam from the fiber was collimated and passed through a bandpass filter (Semrock, LL01-532-25) before entering the measurement cell.The laser power was set at 160 mW at the output of the bandpass filter.After passing through the measurement cell, the beam was split using a 532 nm dichroic beam splitter (Semrock, Di03-R532-t1-25x36).The reflected beam was used for holographic imaging.It passed through an attenuation filter (Sigma Koki, MFND-25-0.1)before a hologram was recorded by a two-dimensional complementary metal-oxide semiconductor (CMOS) 2464 × 2056 pixel array (JAI, GO-5100-USB).Images were taken continuously with a 50 µs exposure time.The lights with wavelengths longer than 532 nm were transmitted through the beam splitter and collected for Raman spectroscopy via a set of lenses (Thorlabs, F810SMA-543) that was mounted to a multi-mode fiber (Thorlabs, M29L01).A 532 nm longpass filter (Semrock, BLP01-532R-25) was placed before the fiber to ensure blocking of the 532 nm beam.A spectrometer with a wavenumber range from 200 to 3100 cm −1 and a resolution of 10 cm −1 (Wasatch Photonics, WP-532-A-S-ER-10) was used.The acquisition period was set at 5 s to maximize the signal-to-noise ratio while avoiding saturation.

Data acquisition
The holographic imaging detector records the interference patterns generated from the interaction between the unscattered laser beam (reference beam) and the scattered light by the particles (object beam).To recover information on particle morphology, the interference patterns are reconstructed as described previously by the authors [10,44,45], using the angular spectrum method [46,47].Copepods, foraminifera, and mineral particles immediately sank to the bottom of the measurement cell while the plastics floated due to their buoyancy.Therefore, the relative distances between the samples, laser and detector were consistent for each particle type.Figure 2(a) shows examples of bright field microscopic images of the samples.Figure 2(b) shows the corresponding reconstructed holographic images of the seawater-immersed particles that were measured using the experiment setup.Morphological characteristics unique to copepods (i.e., antennae and legs) and foraminiferas (i.e., spines) are clearly seen in the holographic images, whereas other particles are not obviously distinguished. 100 holographic images of each sample were taken, where the measurement cell was shaken and rotated between images so that the samples were imaged from different angles and directions.The width of the images was trimmed to 2056 pixels so as to cut off the unilluminated region, and it was manually confirmed that the whole sample was visible in all images.The images were normalized so that each image's maximum and minimum pixel intensities were 1 and 0, respectively.
Wavenumber, cm -1 Wavenumber, cm -1 Wavenumber, cm -1 Wavenumber, cm -1 Wavenumber, cm -1 Wavenumber, cm -1 120 Raman spectra were taken for each sample.To reduce noise, 50 spectra were randomly selected and averaged, where this process was repeated using the boot-strapping method [48] to produce 100 unique spectra [37].The background spectrum was taken using the same setup without any target particles and the signal was averaged in the same way.Each averaged spectrum was normalized by setting the S-O stretching peak at 981 cm −1 to have unitary intensity.This peak was chosen as it is always present in seawater due to dissolved SO 2− 4 [49].The background spectrum was subtracted from the averaged spectrum for each particle sample to remove the contributions of the optical setup and seawater.The spectral range from 300 to 1711 cm −1 (309 pixels) was used for analysis since the wavenumbers out of this range do not have many Raman peaks.Fluorescence signals were modeled in the range and subtracted using an eighth or ninth-order polynomial asymmetric truncated quadratic function depending on the samples.The most suitable order was experimentally determined, using the MATLAB "backcor" function [50], which estimates background signals by minimizing a non-quadratic cost function.Figure 2(c) shows examples of processed Raman spectra for each sample type.Strong Raman peaks of PP and PE (PP: 809, 841, 1152, 1167, 1330, 1458 cm −1 , PE: 1062, 1130, 1170, 1295, 1418, 1440, 1461 cm −1 [51]) are observed in the spectra of nurdles as these samples are semi-transparent, enabling high efficiency collection of forward Raman scattering, while for other particles the Raman peaks are generally less distinct, due to high opacity of the targets.Peaks at 1062, 1295, and 1440 cm −1 are observed in the spectra of PE fragments, although peaks are not as strong as the ones seen in PE nurdle spectra due to the interference from green pigments.An intense band from carotenoid is seen at 1521 cm −1 [52] in copepod spectra.A peak assigned to the symmetric stretching vibration of the CO 2− 3 ion is seen at 1090 cm −1 [53] in the foraminifera spectra, while other unidentified peaks are also observed.The overall intensities of mineral spectra are weaker than other spectra with no strong peaks observed.

Unsupervised feature learning
We investigated autoencoder-based unsupervised feature learning approaches to group the different particle types.The advantage of unsupervised methods is that they do not rely on human-labeled data for training, which do not always exist and are often time consuming to generate [54].Autoencoders are a generic type of unsupervised feature learner that has been well established for the analysis of imagery, including holographic images [55].They consist of an encoder network, which reduces the input data down to smaller latent representations, and a decoder network that attempts to reconstruct the original data from the compressed latent representation.The latent representations through optimization of both networks to minimize the difference between the original inputs and their reconstructions can be used as features for clustering and classification tasks [56].Classification based on features extracted using autoencoders can outperform the use of features that have traceable physical meaning such as principal component analysis [57,58].A key advantage is that they are unsupervised, and can flexibly manage different sizes and dimensionality of data inputs as well as the size of the latent feature space representations they output, without significant modification of their underlying form, which is suitable for multimodal data [29].Figure 3 illustrates the proposed multimodal holographic image and Raman spectrum feature learning.A convolutional autoencoder was used to extract features from the holographic image reconstructions.Deep-learning convolutional autoencoders based on Alexnet have been successfully developed for sub-sea image classification [59,60].When applied to holographic images, improvement of clustering performance was found when a modified AlexNet where the fully-connected layers were replaced by two convolution layers was used [45].Here we used the same modified AlexNet-based deep learning autoencoder described in Ref. [45], which was well tuned for in-line holographic images.The entire dataset (1800 images) was used to train the network after reducing each image to 227 × 227 pixels to fit the input layer.When only images were used in the subsequent analysis, 16 latent features were extracted based on recommendations of prior work [59].This was reduced to 8 when features were combined with those extracted from spectra so that the total number of extracted features was maintained.Information about the particle type was only used for performance validation and was not used in training.The Raman spectra obtained with our setup are one-dimensional (309 × 1) and have a significantly smaller data size than the holographic images.A single-layer autoencoder was used to learn features where the latent representation size was set to 16 when only spectral information was used, and to 8 when features were combined with those extracted from holographic images.Once features were extracted from the encoders, k-means clustering was used to group particles.This method was chosen as it is unsupervised and so does not require any human-labeled training data.We note that while different unsupervised clustering approaches such as random forest and self-organized maps, or supervised methods such as support vector machines, neural network classifiers or Gaussian processes may improve overall scores, the focus of this paper is on improving the quality of the features used for subsequent analysis, and such optimization of clustering or classification methods is beyond our scope.
The number of clusters was set to 6, which equals the number of particle types used in this study.We investigated two grouping methods.The first method is feature-level fusion, and directly uses the latent representations.The second method is model-level fusion and uses non-linear dimensional reduction to further compress the latent representations prior to clustering.For the direct approach, k-means clustering was carried out directly on the features extracted from holographic images (condition D1), Raman spectral data (condition D2), and on the combined features (condition D3), respectively.The latent space was set so that the final number of features used for clustering was the same, at 16 features, across all experimental conditions to allow for a fair comparison.For the reduced approach, a further reduction from 16 to 2 dimensions is achieved using the non-linear t-distributed stochastic neighbor embedding (t-SNE) [57,61].Clustering was performed on the reduced two-dimensional features extracted from holographic images (condition R1), Raman spectral data (condition R2), and on the combined features (condition R3), respectively.Clustering performance was assessed using confusion matrices and F1-average score (i.e., macro F1 score [62]), where cluster to particle type correspondence was achieved by determining the largest number of particles of a given type falling within each cluster.The different experimental conditions investigated in this work are summarized in Table 2.

Results and discussion
Figure 4 shows the t-SNE plots of the latent representations extracted from (a) holographic images, (b) Raman spectroscopy, and (c) their combination.The color of data points indicates particle type (black: copepod, red: foraminifera, blue: mineral, pink: PP nurdle, purple: PE nurdle, green: PE fragment).The shape indicates three different samples among the same type of particles (circle, cross, bar).Table 3 shows the confusion matrix result of k-means clustering applied directly to the extracted features, and Table 4 shows the result of clustering applied to the extracted features that have been further reduced using t-SNE.The clustering groups A-F were automatically allocated to six clusters with the combination which gives the best F1-average score.Table 5 shows the F1-scores for each particle type and processing condition.Using features extracted from holographic images alone (D1, R1), it can be seen that copepods and foraminiferas form one mixed cluster.The remaining four particle types form the second cluster.This can be understood by looking at the examples in Fig. 2, where copepods and foraminifera have complex shapes, while the remaining particle types have a simpler form.PE fragments have an angular shape that distinguishes them from the round shape of the mineral and PE, PP nurdles, where this pattern can be seen by the increased separation between it and the other particle types.Clustering with k = 6 results in groupings with mixed particle types, where an overall trend that two clusters dominate is reflected in the confusion matrices for D1 (Table 3(a)) and R1 (Table 4(a)).The F1-score averages are higher for clustering after using t-SNE for dimensional reduction rather than direct use of the latent representations.For Raman spectral data (D2, R2), 13 distinct groupings can be seen, where for most particle types the individual samples are separated.While copepods and minerals form their own groups for all samples, other particle types form two or three separate clusters for each type, which are not necessarily close together in the latent representation space.This reflects the sensitivity of Raman spectroscopy based features to differences in the individual samples regardless of particle type.The over discrimination is seen in the confusion matrices for D2 and R2 in Tables 3(b) and 4(b), respectively.The individual samples fall in or out of the six clusters in a binary manner, where the precision and recall rates for direct use of extracted features vary from 0 to 100%.Although this trend is improved after t-SNE, the overall accuracy according to F1 scores is reduced, where dimensional reduction results in poorer accuracy for the plastic particles in particular.The results show that it is not possible to reliably cluster features from Raman spectra to map onto the 6 particle types.The average F1 scores for holographic images (D1) and Raman spectra (D2) have similar values of around 0.5 and 0.6, respectively, where further dimensional reduction improves the score for holographic images (R1), but not for Raman spectra (R2).Combining the features from holographic images and Raman spectra improves the F1 scores for both the direct (D3) and the reduced t-SNE based (R3) clustering.In particular, dimensional reduction results in significant performance gains where both data types are combined.This is seen with foraminifera, where direct use of the latent representations has poor precision and recall, but dimensional reduction improves these from 3% to 97% and 2% to 66%, respectively.D3 and R3 confusion matrices are shown in Table 3(c) and Table 4(c), respectively.
Table 5 shows that combining features give the highest F1 score for all particle types investigated in this work.The highest average F1 score of 0.88, is obtained for condition R3, where combined features after non-linear dimensional reduction using t-SNE are used.This score is 0.25 higher than for the directly combined case, and ≥0.27 higher than when the holographic image or Raman spectrum based features are used in isolation.Condition D3 gives the second best results.For condition R3, all particle types have F1 values over 0.79, demonstrating reliable mapping of the clusters onto the particle types of interest.The large performance gain when non-linear dimensional reduction is applied to the combined features can make effective use of the favorable characteristics of each measurement type.The t-SNE plot in Fig. 4(c) shows that copepods, minerals, and PP nurdles form groups with well separated boundaries.One sample of PE nurdles forms a group that is independent of others and one sample of foraminifera merges with a cluster of PE fragments.In both cases, it could be assumed to be mainly due to the features of Raman spectra as these trends are also seen in the t-SNE visualization of Raman spectral latent representations (Fig. 4(b)).This could be mitigated by using fewer features of Raman spectra.In future works aiming at real-sea applications, fine tuning of models including selecting the best combination of the number of features among different data types will be performed to improve clustering and classification performances.
The results show that features extracted using an appropriately designed autoencoder and further use of t-SNE for non-linear dimensional reduction significantly improve the quality of the features available to describe different particle types, and this improvement enhances classification accuracy.For application to in situ monitoring of marine particles, the method needs to be verified on larger numbers and types of particles to be more representative of the variety of morphological and compositional combinations that exist in nature.However, the study has demonstrated a novel approach to combine features learned from multiple different sensing modes, which improves clustering performance for a diverse range of marine particle types.Since the proposed method of combining and blending features can be applied to any input data type using encoded latent representation spaces, the method forms a versatile approach to combine measurements taken from multiple sensors with different data types and sizes, and makes efficient use of the favorable characteristics of each measurement type.

Conclusion
We have proposed a novel method to combine features extracted from images and spectra of seawater-suspended particles.Features were first extracted from data taken of the same target using an integrated setup for holographic imaging and Raman spectroscopy.Convolutional and single-layer autoencoders were used for holographic images and Raman spectra, respectively.While combining latent representations (feature-level fusion) slightly enhanced the macro F1 average score, the performance is further significantly improved by performing non-linear dimensional reduction (model-level fusion) using t-SNE on the combined latent representations.This increases the calculated accuracy from 0.63 to 0.88 using t-SNE, and the use of combined features outperformed a single information source for all particle types studied in this work.
Although our experiments used holographic images and Raman spectroscopy, the proposed method can be adapted to other types of sensor measurements.The use of convolutional and conventional autoencoders can learn and extract features from any two-or one-dimensional data type (e.g., images, spectra) without the need for labeled training datasets, respectively.Since dimensional reduction is performed on the feature space, it can efficiently combine features derived from other sensing methods and be applied to other measurement targets with minimal modification.

Fig. 1 .
Fig. 1.Experimental setup.A 532 nm single longitudinal mode laser is used to illuminate samples suspended in bulk artificial seawater.A beam splitter is used to take holographic images and Raman spectra using the same setup with different exposure times.

Fig. 2 .
Fig. 2. Examples of (a) bright field microscopic images, (b) reconstructed holographic images and (c) processed Raman spectra for each particle type.

1 Fig. 3 .
Fig. 3. Diagram of processes for combining features extracted from the holographic image and Raman spectra, which are used for clustering either directly or after applying t-SNE dimensional reduction.

Fig. 4 .
Fig. 4. t-SNE visualization of latent representations extracted from (a) holographic images, (b) Raman spectra, and (c) their combination.The color of data points indicates particle type and the shape indicates three different samples among the same type of particles.

Table 3 .
Confusion matrix between particle type and the clustering result created using k -means for (a) holographic images D1, (b) Raman spectra D2, and (c) combined D3 latent representations.A-F indicate clustering groups.

Table 4 .
Confusion matrix between particle type and the clustering result created using k -means after t-SNE dimensional reduction for (a) holographic images R1, (b) Raman spectra R2, and (c) combined R3 latent representations.A-F indicate clustering groups.(a)R1