Chemotaxonomy as a tool for interpreting the cryptic diversity of Poaceae pollen

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.


Introduction
The correct identification of pollen grains is an important factor in any research area that uses pollen assemblages to make inferences about vegetation. These research areas can be as diverse as palaeoecology (Germeraad et al., 1968;Mander and Punyasena, 2014), forensics (Horrocks et al., 1998;Mildenhall et al., 2006) and melissopalynology (Herrero et al., 2002;Martin, 2005), as they all share a reliance upon the taxonomic resolution of pollen identification to maximise the accuracy and usefulness of their data. Looking further back into geological time, palynological research has played a fundamental role in understanding plant origination and radiation (e.g. the origin and radiation of vascular plants (Rubinstein et al., 2010), and the radiation of the angiosperms (Lupia et al., 1999)), and shaped our understanding of how the terrestrial biosphere responded to mass extinction events (Looy et al., 2001;Tschudy et al., 1984). This highly diverse group of studies all shares a reliance upon the taxonomic resolution of pollen identification to maximise the accuracy and usefulness of their data. The utility of pollen and spores as an archive becomes reduced, however, when taxonomic resolution leads to a loss of information (Bush, 2002).
The Poaceae (grass) family exemplifies this problem, as it comprises 11,554 currently accepted species in 759 genera (The Plant List, 2013), which exist across a wide climatic gradient, from Antarctica to tropical lowland rainforest. Yet pollen grains from this family are almost indistinguishable below family level using light microscopy, therefore they are generally not classified below 'Poaceae' by the majority of palynologists (Faegri et al., 1989;Holst et al., 2007;Strömberg, 2011). Consequently Poaceae pollens are essentially a rich yet currently underdeveloped archive ripe for palynological research.
Extensive research over the last four decades has used a variety of tools to determine if the identification of Poaceae pollen to below family level is possible. This analysis has been on individual grains using: (i) surface pattern analysis of images of pollen grains obtained through scanning electron microscopy (SEM) (Andersen and Bertelsen, 1972;Mander et al., 2013;Waikhom et al., 2014), (ii) detailed morphometric analysis considering whole grain and pore morphology (Joly et al., 2007;Schüler and Behling, 2010), and (iii) confocal microscopy of pollen exines (Salih et al., 1997). A success rate in identifying Poaceae pollen to species level of 85.8% has been achieved through SEM (Mander et al., 2013), and this technique has even allowed differentiation of cultivars (Datta and Chaturvedi, 2004). These methods, although successful, are Review of Palaeobotany and Palynology 235 (2016) 140-147 time consuming and require considerable sample preparation, laboratory work, and expertise. Therefore, from a practical perspective, the application of these techniques to palaeoecological questions has not yet occurred.
Fourier Transform Infra-Red spectroscopy (FTIR) has recently been used to differentiate pollen taxonomically, demonstrating it is possible to distinguish between plant orders, and in some cases to species level (Dell'Anna et al., 2009;Pappas et al., 2003;Zimmermann, 2016Zimmermann, , 2010. FTIR analysis has also been successfully used in characterising pollen surface compounds (Pummer et al., 2013). FTIR analysis generates absorbance spectra, with bands relating to chemical bonds within specific functional groups. The size, shape and position of these bands provides information about the type of bonds present and their chemical environments, which, in the case of biopolymers such as sporopollenin, can be very complex (de Leeuw et al., 2006;Fraser et al., 2014bFraser et al., , 2011Watson et al., 2012Watson et al., , 2007. Interpretation of FTIR spectra relies upon knowledge of the type of bonds likely to be present in a substance, and how they might vary. In this study, we treat spectra statistically and use classification algorithms to identify pollen, thus removing the need for in-depth biogeochemical analysis. Spectra produced by FTIR analysis are affected by a number of operational factors, such as intensity of beam, thickness of sample and thickness of slide (if using a microscope enabled FTIR). Spectra may be noisy if the sample to be scanned (and therefore aperture size) is small, or the material is of poor quality, for instance if pollen grains are degraded. Degradation of the samples used in this study is not expected to be significant, although may be present, as some chemical changes have been observed over short time (hours-days) periods (Zimmermann et al., 2015). Changes in spectra driven by degradation can, however, be accounted for by using statistical processing techniques prior to analysis (Zimmermann and Kohler, 2013). For example, use of algorithms such as Savitsky-Golay smoothing can alleviate noisiness, but potentially remove useful information such as subtleties in shape of bands from spectra if their parameters are not calibrated properly, whereas generating first and second derivatives of spectra may result in degradation of the signal-to-noise ratio (Brown et al., 2000;Zimmermann and Kohler, 2013). The chemical structure of sporopollenin, is known to be very stable over geological time  and resistant to diagenetic alteration (Watson et al., 2007;Fraser et al., 2014a), meaning that the interpretation of the fossil record may benefit from the application of this technique.
Here we show that analyses of FTIR spectra from a selection of Poaceae taxa can be used to successfully identify pollen grains. Using a simple nearest neighbour classification algorithm our results have very similar levels of success when compared to much more expensive and labour intensive methods currently deployed, such as SEM (Mander et al., 2013). Therefore, FTIR based analyses raise the possibility of a further exploration of the grass pollen record.

Sample collection and preparation
A total of twelve grass taxa were analysed from eight subfamilies (Table 1) across the grass phylogeny, as outlined in the latest publication by 'The Grass Phylogeny Working Group' (Grass Phylogeny Working Group II, 2012). The sampling strategy employed ensured a wide phylogenetic spread whilst also enabling analysis of lower-order identification by sub-sampling some subfamilies, such as the Ehrhartoideae. Poaceae pollen was obtained from herbarium specimens at the Royal Botanic Gardens, Kew (London, UK) by dissecting out stamen from individual florets. Where possible, two or more specimens for each species were sampled, and specimens from Ghana or neighbouring tropical West African nations were preferentially sampled, to complement current palaeoecological (fossil pollen) investigations at Lake Bosumtwi, Ghana (Miller and Gosling, 2014), and to reduce large-scale environmental variability as much as possible.

Chemical analysis
The pollen was washed in acetone and allowed to air-dry on zincselenide slides. Groups of two or more pollen grains clustered together were examined using a Continuum IR-enabled microscope with a 15× reflechromat objective lens and nitrogen-cooled MCT-A detector in transmission mode. The microscope was linked to a Thermo Nicolet Nexus (Thermo Fisher Scientific, Waltham, MA, USA) FTIR bench unit at The Open University. Spectra were averaged over 256 scans per sample, and background scans were taken before each sample to alleviate any atmospheric contributions. Visual inspection of spectra and atmospheric suppression correction was conducted using OMNIC software (Thermo Fisher Scientific, Waltham, MA, USA).

Data processing and analysis
Average spectra were calculated from multiple replicates for every sample (Fig. 1). These average spectra were inspected visually, and comparisons of selected absorbance bands were compiled (after Steemans et al., 2010) to determine potential structural drivers of the statistical patterns observed ( Table 2). The absorbance bands chosen were based on those used by other researchers investigating sporopollenin composition. Bands that do not vary between taxa are omitted from visual inspection; for instance, the broad OH band at 3300 cm −1 is omitted, as it is present in all taxa in the same form and thus provides no visually quantifiable classification information. The bands included in the visual inspection and references to papers which have used them in investigations of sporopollenin are as follows: C=C band at 3070 cm −1 , (Fraser, 2008); vasCH 2 and vsCH 2 at 2925 cm −1 and 2850 cm − 1 respectively, and vC=O at 1710 cm − 1 , Watson et al., 2007); vasCH 3 at 2960 cm − 1 , ; vsCH 3 at 2890 cm − 1 , (Fraser, 2008); C=C non-conjugated at 1660 cm −1 , (Fraser et al., 2014b;Steemans et al., 2010;Zimmermann and Kohler, 2014), OH at 1630 cm −1 (Fraser, 2008); C=C (aromatic ring stretch) at 1500 cm −1 , (Fraser et al., 2014b;Lomax et al., 2008;Watson et al., 2007); CH n (asymmetric bending) at 1460 cm − 1 and CH 3 (symmetric bending) at 1375 cm −1 , ; C=C or CH n at 720 cm −1 , Zimmermann and Kohler, 2014).
All average spectra were z-score standardised (i.e. standardised to zero mean and unit variance) by finding their mean amplitude, subtracting the mean from the actual values, and dividing by the standard deviation. When no other treatments were applied, these z-score standardised spectra are referred to as 'Unprocessed Spectra' (see Fig. 2 for information on processing). These standardised spectra were not subject to variations in signal amplitude due to variable sample thickness (Duarte et al., 2004;Jardine et al., 2015). Standardisation of spectra (and all other statistical manipulations) were performed in R Further processing of raw data was conducted to investigate various extraneous factors that may have impacted upon the analyses. Such processing involved one or more of the following: (i) truncation, (ii) baseline correction, and/or (iii) atmospheric suppression of the region from 1800 to 2700 cm −1 (Fig. 2) to remove the effects of CO 2 and scattering from that region ('Truncated Spectra'). Baseline correction using the R package baseline (Liland and Mevik, 2015) was conducted because some unprocessed spectra exhibited climbing baselines on visual inspection ('Baseline Corrected Spectra'). Atmospheric suppression using the OMNIC atmospheric suppression algorithm and z-score standardisation was performed to assess the impact of water vapour across the spectra ('Atmospheric Suppressed Spectra'). Collectively, this approach allowed us to compensate for possible noise and/or atmospheric effects and to directly compare against the 'unprocessed spectra'. Fig. 2 shows a visual summary of the stages performed during spectral processing.
The final stage of processing was the application of 'spectral pre-processing' sensu Zimmermann and Kohler (2013) to both the raw spectra and all of the modified spectra. This final stage of the processing involved generating first and second derivatives of spectra, and applying Savitsky-Golay spectral smoothing (Zimmermann and Kohler, 2013). The Savitsky-Golay smoothing technique applies an algorithm which approximates the spectrum using polynomial least-square fitting to a moving window. Both polynomial order, and window size used affect the resultant spectrum. As different window sizes may be optimal for different regions of the spectrum, and this study aimed to provide a simple tool for pollen identification and thus used the whole spectrum, a window size of 11 was chosen; this falls within the range of optimal values defined by Zimmermann and Kohler (2013). The R package Prospectr was used to perform Savitsky-Golay smoothing and derivation of spectra (Stevens and Ramirez-Lopez, 2013). Principle Component Analysis (PCA) was used to visualise all resultant data from the processing steps detailed above.
Using the R-package, Class (Venables and Ripley, 2002), k-nearest neighbour (k-nn) analyses were performed, using leave-one-out cross-validation to generate identification success rates. This analysis uses a Euclidean distance matrix to classify individual spectra based upon their nearest neighbours in the matrix, and then removing one

Table 2
Chemical bond, wavenumber of absorbance band and presence in difference Poaceae subfamilies, after Steemans et al., 2010. 'sh' indicates a shoulder, '+' a weak absorbance band, '++' a strong absorbance band and '+++' very strong. * indicates where an absorbance band differs in strength within subfamily and '−' indicates an absence of that absorbance band. '~' before wavenumber indicates the position of a band may vary. In style of Steemans et al. (2010).

Group
Wavenumber ( data point (a single spectrum) before re-introducing it for classification. The number of nearest neighbours is instrumental in classification success, as increasing k expands the analysis to include the next closest points, with classification decided by majority vote (the sample is classified based on the most common taxon among its nearest neighbours). So for k = 1, only the closest point is included in the analysis, whereas in k = 5, the 5 nearest neighbour points are included, with the classification being decided by majority rule, or the most common taxon among those points. To test whether or not the success of the classification algorithm was due to sample size, we used a null model which assigns names to points randomly within the Euclidian distance matrix generated from the original data, and then applies the classification test to the new matrix, repeating the process 999 times. This process allows the distribution of random successful classifications to be compared to the actual classification success.

Results
FTIR spectra of pollen from different grass subfamilies display broadly similar characteristics (Fig. 1). Nevertheless, there are small variations in the spectra between subfamilies and lower orders of classification (Table 2) (de Leeuw et al., 2006;Watson et al., 2007). Some absorbance bands are present in, and are similar between, all taxa analysed, such as the C=C stretching band of C=C\ \H group at 3070 cm − 1 , the vasCH 2 and vsCH 2 bands at 2925 cm − 1 and 2850 cm −1 respectively, and the C=C aromatic ring stretch at around 1500 cm −1 (Table 2). Other bands are variable both between and within subfamilies, such as the vasCH 3 and vsCH 3 bands at 2960 cm −1 and 2890 cm − 1 , respectively (Table 2). Absorbance bands that vary between and within subfamilies include the non-conjugated C=C band at 1600 cm −1 , the asymmetric bending CH n at 1460 cm −1 , the symmetric bending CH 3 band at 1375 cm −1 and the cis-substituted C=C/CH n rocking band at 720 cm −1 . Some absorbance bands exhibit more within-subfamily variation (* in Table 2), such as the band at 1630 cm −1 , whereas others display less within-subfamily variation, but differ in presence and shape between subfamilies, as exemplified by the band at 720 cm −1 .
PCA of the pre-processed spectra (Fig. 3) shows that groupings are present but complex. Most subfamilies appear broadly grouped together but with significant spread over one or both PCA1 or PCA2. The Ehrhartoideae, for instance, are distributed along PCA1, but more closely grouped along PCA2. The Bambusoideae, however, are less clearly grouped, with a wide spread across both axes. No subfamily is clearly separate from the others, with all exhibiting some degree of overlap with other subfamilies. Neither are there clear phylogenetic groupings, as subfamilies do not cluster according to degree of relatedness (Grass Phylogeny Working Group II, 2012). Loadings plots of the PCA show significant contributions from CO 2 (at around 2400 cm −1 on the x-axis). This contribution is not as prevalent in treated spectra (see SOM for details).
The combined treatment of first derivative with Savitsky-Golay smoothing returns the most accurate classification, with successful classifications at subfamily level reaching 80%. Other treatments, including no pre-processing, first and second derivatives, and second derivative with Savistsky-Golay smoothing did not result in as high a proportion of successful classifications (Fig. 4) at any taxonomic level. The maximum success achieved by the algorithm to genus level was 77%, to species level 77%, and individual level 69%. A random permutation test confirmed that the chances of achieving the successful classification proportions observed if no pattern was present were 0.01%. The success rates at subfamily level for the different treatments outlined in Fig. 2 are presented in Table 3. This shows that unprocessed and truncated spectra have the highest success rates, at 80% and 82% respectively, while all Atmospheric Suppressed Spectra show lower identification success rates. Subfamily level results have been presented here for the sake of brevity and comparison, but for full results, including k values, see SOM.

Discussion
From visual (Fig. 1) and PCA (Fig. 3) analyses of the FTIR spectra it was possible to discern differences between subfamilies via analysis of absorbance band strength and nature although, as is demonstrated by Table 2, chemical differences between taxa are complex and subtle. The exact composition of sporopollenin remains enigmatic, due to its inert nature and resistance to decay or chemical degradation, even over very long periods of geological time . Sporopollenin is, however, known to comprise aromatic components such as para-coumaric acid and ferulic acid, linked by esther linkages and aliphatic components (Boom, 2004;de Leeuw et al., 2006;Fraser et al., 2014b;Watson et al., 2012). Based upon Table 2, and a knowledge of the broad structure of sporopollenin, it is likely that chemical variation between taxa arises from variation in aliphatic chain length (shown by the variation in shape and presence of CH n bands in Table 2) and degree of saturation (shown by variation in C=C bands). Further chemical analyses via gas chromatography/mass spectrometry (GC/MS) should be able to confirm these tentative conclusions, but as we are primarily concerned with the development of an identification tool, not the chemical structure of sporopollenin, this analysis is not pertinent here. Some scattering artefacts and noise are evident in the spectra, likely the result of either: (i) the absence of physical atmospheric suppression during sample analysis, and/or (ii) for 8% of samples, sufficiently large groups of pollen grains were not present on the slide, so that some single-grain measurements had to be taken (see SOM for details of numbers of grains scanned for each sample). We consider, however, that these artefacts and noise are random in their nature and would affect all samples equally. A full analysis of the impact of pollen grain number of identification success rate can be found in the supplementary information (SOM).
The first and second derivatives, Savitsky-Golay smoothing and combinations thereof, comprise some of the most commonly used processing techniques used in analysis of FTIR generated spectra (Zimmermann and Kohler, 2013). Using first and second derivatives without Savistky-Golay smoothing on unprocessed spectra shows a decrease in classification success across all taxonomic levels compared to Fig. 3. PCA plots of Raw and Raw SG1 (Savitsky-Golay smoothed, first derivative) spectra and, below them, their loadings plots, illustrating areas of the spectrum which contribute most strongly to the signal. raw spectra, from a raw data success rate at species level of 74% to 49% success for first derivative spectra and 37% for second derivative data (see Fig. 3). This is likely due to the exaggeration of the contribution of noise to the analysis of un-smoothed data, which is eliminated by the point-averaging smoothing effect of the Savitsky-Golay smoothing algorithm (Zimmermann and Kohler, 2013).
Truncated Spectra show the highest success rate, at 82%, with Unprocessed Spectra at 80% and Baseline Corrected Spectra at 81% (Table  3). Spectra which were subjected to atmospheric suppression showed large reductions in identification success rates (Table 3). As the Atmospheric Suppression was performed using an algorithm provided by the OMNIC software, we consider that it is therefore likely the suppression algorithm may remove useful information from the grass pollen spectra. This is because FTIR, and the associated OMNIC software are most commonly used to analyse pure chemical samples, whereas the identification of grass pollen relies on subtle chemical variations in sporopollenin. Further, we believe the OMNIC atmospheric suppression may remove useful information from spectra because Truncated Spectra return higher identification success rates, whilst having wavenumbers removed which do not contain useful information, only CO 2 contributions and scattering effects (Mohlenhoff et al., 2005).
Pollen grains used in this study were washed in acetone, which would have removed surface chemicals and traces of insecticide or other contaminants. It is likely, however, that internal material such as intine and possibly cytoplasm remained present in the grains. This means that the signal observed from grass pollen is not that of 'pure' sporopollenin, but, rather, represents the pollen grain as a whole entity (Blokker et al., 2006). For the purposes of modern pollen identification, this does not pose an immediate issue, as it is likely that cytoplasmic and pollen-wall associated compounds contribute to the taxonomic signal detected in this study, and others which have focussed on other taxonomic groups such as the Pinales (Zimmermann, 2010). For fossil pollen samples, or pollen that has been acetolysed, however, some of this information may be lost, either due to chemical alteration during processing (Jardine et al., 2015), or lack of non-sporopollenin pollen components in preserved grains. In these cases, it may be beneficial to use FTIR analysis in conjunction with other, structural methods of distinguishing between similar taxa, such as those described by Sivaguru and Mander (Mander et al., 2013;Sivaguru et al., 2012). Environmental factors could affect success of classification, as it has been demonstrated that factors such as UVB irradiance (Fraser et al., 2014a;Lomax et al., 2008), year-on-year environmental factors (Zimmermann and Kohler, 2014) and heat stress (Jiang et al., 2015;Lahlali et al., 2014) have an impact upon the chemical composition of pollen grains and spores. Our samples were also taken from a relatively narrow geographical and temporal range (tropical West Africa, and within the 20th Century), compared to the time ranges and geographical ranges dealt with in the fossil record. Therefore, we suggest that the effects of any UVB fluxes are likely to be minimal in this sample set, as UVB changes occur over longer time scales and wider geographical areas (Magri, 2011). The fact that classification success improves from individual to species level, where individuals from the same species have been collected in different areas and different years (SOM) demonstrates a clear taxonomic signal. The absence of a significant improvement between species and subfamily level suggests that although this is a successful classification technique, it is not a phylogenetic one. Although the number of samples varies between taxa, the randomisation algorithm demonstrates that successful classification is not due purely to chance.
These results show that differentiation between Poaceae taxa below family level is possible using the relatively fast and inexpensive method of FTIR microspectroscopy. At subfamily level, it is possible to achieve an 80% classification success rate, and at species level, a 77% classification success rate. The ability to identify grass pollen to subfamily level, or below, allows for a more detailed interrogation of grass pollen record. One specific benefit of increased taxonomic resolution could be the recovery of a more complete ecological picture from the fossil pollen record and the determination of previously hidden plant-climate relationships. For instance, the identification of pollen from the Puelioideae subfamily would indicate a forest-origin for the grass pollen observed, whereas Ehrartoideae might suggest a more open habitat (Grass Phylogeny Working Group II, 2012).

Conclusions
We have shown that FTIR analysis followed by spectral and statistical processing has the potential to significantly improve pollen identification within the Poaceae. Our data demonstrate that it is possible to achieve an 80% successful classification rate to subfamily level for Poaceae pollen, which, when applied, will allow new insights into  taxonomic resolution in fossil pollen records. The rapidity and relative low costs of FTIR analyses make this a potentially very useful method for subfamily identification of Poaceae pollen.