Detect and exploit hidden structure in fatty acid signature data

. Estimates of predator diet composition are essential to our understanding of their ecology. Although several methods of estimating diet are practiced, methods based on biomarkers have become increasingly common. Quantitative fatty acid signature analysis (QFASA) is a popular method that contin-ues to be re ﬁ ned and extended. Quantitative fatty acid signature analysis is based on differences in the signatures of prey types, often species, which are recognized and designated by investigators. Similarly, predator signatures may be structured by known factors such as sex or age class, and the season or region of sample collection. The recognized structure in signature data inherently in ﬂ uences QFASA results in important and typically bene ﬁ cial ways. However, predator and prey signatures may contain additional, hidden structure that investigators either choose not to incorporate into an analysis or of which they are unaware, being caused by unknown ecological mechanisms. Hidden structure also in ﬂ uences QFASA results, most often negatively. We developed a new method to explore signature data for hidden structure, called divisive magnetic clustering (DIMAC). Our DIMAC approach is based on the same distance measure used in diet estimation, closely linking methods of data exploration and parameter estimation, and it does not require data transformation or distributional assumptions, as do many multivariate ordination methods in common use. We investigated the potential bene ﬁ ts of the DIMAC method to detect and subsequently exploit hidden structure in signature data using two prey signature libraries with quite different characteristics. We found that the existence of hidden structure in prey signatures can increase the confusion between prey types and thereby reduce the accuracy and precision of QFASA diet estimates. Conversely, the detection and exploitation of hidden structure represent a potential opportunity to improve predator diet estimates and may lead to new insights into the ecology of either predator or prey. The DIMAC algorithm is implemented in the R diet estimation package qfasar.


INTRODUCTION
Understanding predator ecology requires knowledge of their diets. Unfortunately, accurate diet estimation can be challenging. All methods of diet estimation have inherent biases or limitations that can influence the accuracy of diet estimates in some situations or for some species (e.g., Klare et al. 2011, Bowen andIverson 2013).
In addition, diets may vary seasonally, be influenced by the juxtaposition of home range location and the distribution of prey species, differ among sex and age classes of predators, or vary due to individual specialization, preference, or behavior (e.g., Bromaghin et al. 2013, Rode et al. 2014, Mohan et al. 2016, Tartu et al. 2016. Consequently, diet estimation remains an active area of research in quantitative ecology. Quantitative fatty acid signature analysis (QFASA) is a popular method of estimating predator diet composition (Iverson et al. 2004). The elemental unit of information in QFASA is a fatty acid signature, which is a vector of proportions describing the mass composition of fatty acids in the storage lipids of an individual predator or prey animal (hereafter signature). The method requires a sample of signatures from predators and each type of prey potentially consumed, as well as constants called calibration coefficients that adjust for the differential metabolism of fatty acids by predators and are typically estimated in controlled feeding trials (e.g., Thiemann et al. 2008, Budge et al. 2012, Rosen and Tollit 2012, but see Bromaghin et al., in press). A predator signature is modeled as a mixture of prey signatures, with diet composition estimated as the mixture that minimizes a measure of distance between observed and modeled predator signatures. Potential advantages of QFASA include the large number of fatty acids in lipids, which can avoid the problem of underdetermined systems common with stable isotope markers (e.g., Phillips andGregg 2003, Brett 2014) and allow the contribution of more prey types to be estimated, and that diet estimates pertain to a period of weeks to months (Budge et al. 2006).
The recognized structure within both prey and predator signature data inherently influences QFASA diet estimation. Prey signatures are grouped into prey types, often by species, and the proportional contribution of each designated prey type to a predator diet is estimated. Consequently, the structure imposed by designation of prey types is one of the most influential decisions made during an analysis. In addition, the patterns of similarity and dissimilarity among the signatures of different prey types largely determine the accuracy with which diets can be estimated (Bromaghin et al. 2016b). If predators primarily consume prey that have distinct signatures, diet estimation may be very accurate, but the converse is also true. In addition, if predator data are structured by factors such as sex or age class, investigators will likely desire an estimate of mean diet composition for each class of predator, and differences in estimated diet composition among classes of predators can provide important insights into feeding ecology (e.g., Bromaghin et al. 2013, Rode et al. 2014).
The existence of structure within signature data that are not explicitly accounted for in an analysis, which we refer to as hidden structure, is likely to influence QFASA in undesirable, though somewhat predictable ways. In particular, hidden structure within a "library" of prey signature data is likely to be associated with increased confusion between prey types, sometimes called prey confounding (Bromaghin et al. 2016a), and decreases the degree to which the mean prey-type signatures used in diet estimation are representative of individual prey animals. As an extreme example, if the signatures of a prey type were comprised of two relatively distinct groups of approximately equal size, the mean signature would not be representative of any individual prey and model performance would almost certainly be reduced. In general, prey confounding decreases the ability of the model to distinguish between prey types and accurately estimate diet composition (Bromaghin et al. 2016a). Hidden structure in predator signatures is probably of less concern in most investigations, but could result in estimates of mean diet for predator classes that overlook interesting aspects of predator ecology.
The detection and exploitation of hidden structure in signature data therefore provide a potential opportunity for investigators to improve diet estimates and gain insights into aspects of ecology linked to diets. Partitioning a library of prey signatures into clusters with a finer resolution than the original prey types, as warranted by the data, can produce prey types that are more homogeneous and distinct and thereby improve diet estimation through reduced prey confounding. Further, the discovery of unexpected structure within either prey or predator signatures can provide new insights into their ecology, or at least generate hypotheses about ecological mechanisms that might underlie such structure (e.g., Marcus et al. 2016).
Multivariate methods routinely used to explore signature data for structure, or more commonly to visualize the structure represented by designated prey types, may not be ideally suited for diet estimation with compositional data. For example, ordination methods such as principal component analysis, linear discriminant analysis, multidimensional scaling, and correspondence analysis are common choices (e.g., Meynier et al. 2010, Wold et al. 2011, Couturier et al. 2013, Galicia et al. 2015, Tartu et al. 2016). Many of these methods are based on an assumption of multivariate normality, even though compositional data are inherently non-normal. Data transformations to either achieve approximate normality or remove the linear redundancy inherent in compositional data (signature proportions must sum to 1.0) are often applied (e.g., Budge et al. 2006, Wang et al. 2016, though rarely evaluated for suitability. Even though use of these methods may be appropriate, the lack of any linkage with methods of diet estimation may limit their effectiveness for exploration of signature data.
We develop and investigate the performance of a new clustering technique to detect and exploit hidden structure in fatty acid signature data. Our primary motivation is to identify clusters of similar signatures within designated prey types to minimize confounding within a prey library and thereby maximize the accuracy of diet estimation. However, we acknowledge that identifying structure within predator data can provide useful insights into their ecology and might be particularly beneficial in some investigations. Our method has three characteristics that we expect will convey benefits when used in combination with QFASA diet estimation. The method is based on the same distance measure that will subsequently be used for diet estimation. Using methods of data exploration that are closely linked to methods of diet estimation increases the internal consistency of an analysis and can be expected to produce clusters of predators or prey that are, in a sense, optimally structured for diet estimation. Basing the method on a measure of distance designed for compositional data avoids the need for data transformation. Finally, the method is nonparametric and therefore requires no subjective distributional assumptions.

Data inputs
We based our investigation on two prey libraries having quite different characteristics, both of which have previously been used to evaluate the performance of QFASA diet estimators (e.g., Bromaghin et al. 2016b, which cites data sources). The marine mammal library is comprised of 357 signatures of 67 fatty acids classified into seven prey types (Rode et al. 2014), and the Scotian Shelf fish library is made up of 954 signatures of 68 fatty acids grouped into 28 prey types (Budge et al. 2002). In addition to the larger number of prey types in the fish library, it is characterized by a higher degree of prey confounding (Bromaghin et al. 2016a). With the mammal library, we used a suite of 31 fatty acids and the marine-fed mink calibration coefficients that have previously been used with this library (Thiemann et al. 2008). With the fish library, we used the extended dietary suite of fatty acids (Iverson et al. 2004) and the calibration coefficients of Rosen and Tollit (2012), with the exceptions of values of 1 for 16:3n-1, 20:1n-5, and 24:1n-11 (no values published for these fatty acids) and 1.04 for 22:2n-6 and 0.23 for 24:1n-9 (Nordstrom et al. 2008).
Both prey libraries were prepared for analysis by replacing fatty acid proportions that were missing or equal to zero with a small positive value equal to 75% of the smallest non-zero proportion in the prey library. Replacement of signature proportions with value zero is common in QFASA because popular distance measures involve logarithms and are therefore not defined for signatures containing zeros (Bromaghin et al. 2016a). Values of the other proportions in each signature were scaled by a constant so that all summed to 1 using the multiplicative method (Mart ın-Fern andez et al. 2011). Signatures were then censored to the suite of fatty acids used with each library (described above) and augmented with an additional proportion so that all summed to 1, which can reduce bias in diet estimates under some circumstances (Bromaghin et al. 2016a). A calibration coefficient for the augmented proportion was obtained using the function cc_aug of the R package qfasar (Bromaghin, in press).

Divisive magnetic clustering
The divisive magnetic clustering (DIMAC) algorithm is a modification of the diana algorithm of the R package cluster (Maechler et al. 2016). The algorithm begins with all signatures in one cluster. The two signatures having the greatest distance between them are selected as initial "magnets," and each signature is drawn to its nearest magnet to form clusters. After the selection of the initial two magnets, the algorithm enters an iterative phase. In each iteration, the mean distance between the mean cluster signature and individual signatures within the cluster is computed for each cluster, and the cluster having the largest mean distance is designated as the "active" cluster. The two signatures within the active cluster having the greatest distance between them are found and selected as new magnets, one of which replaces the original magnet for the active cluster, so the total numbers of clusters increases by one during each iteration. All signatures are then assigned to their nearest magnet, without regard for cluster membership in the preceding iteration. The distance between individual signatures and their cluster mean is summed across clusters as a measure of distance reduction relative to the preceding iteration. The iterations continue in that fashion, with one additional cluster added each cycle, until each signature is a singleton cluster.
We applied the DIMAC algorithm separately to the data for each original prey type using the Aitchison distance measure (Aitchison 1986). We used the Aitchison distance because it is commonly used with compositional data and has been reported to have advantages over other distances commonly used with QFASA (Bromaghin et al. 2016b). Because the within-cluster distance will generally decrease from its maximum when all signatures are in one cluster to zero when each signature forms a singleton cluster, only unusually large reductions in total distance signal the discovery of meaningful structure. Consequently, we examined distance by iteration to find large reductions that potentially identified an appropriate number of clusters for each prey type and partitioned prey signatures into those clusters.

Leave-one-prey-out analyses
We performed leave-one-prey-out analyses (Bromaghin et al. 2016a) to determine whether partitioning the prey libraries successfully reduced prey confounding compared to the original libraries, and therefore might produce more accurate diet estimates, and also to potentially finetune a partition. An individual prey signature was temporarily removed from the library, the mean prey-type signature was recomputed, and the mixture of prey types best approximating the removed signature was estimated identically to diet estimation without calibration coefficients. This process was repeated for each signature in a library and the mixture estimates were averaged by prey type and across all signatures irrespective of prey type. The degree to which estimates were associated with the correct prey types measured the distinctiveness of prey types within a library, and patterns of estimation error measured confounding between prey types.
A leave-one-prey-out analysis was performed with the original prey library to provide a comparative baseline. A leave-one-prey-out analysis was then performed with the partitioned library, and the estimates were pooled back to the original prey types for comparison. If the partition produced only marginal improvement or caused increased confounding between some clusters, compared to the results obtained with the original library, it was reevaluated and potentially modified. This evaluation was performed iteratively until the overall increase in performance was maximized with the smallest number of clusters.

Example diets
The degree to which a partitioned library might lead to improved diet estimates was assessed by simulating diet estimation with a small number of example diets. We used the six "realistic" diets of gray seals (Halichoerus grypus, four combinations of sex and season) and polar bears (Ursus maritimus, adult females and males), predators suitable for these libraries, obtained from the literature and previously used as test cases to investigate the performance of QFASA diet estimators . In addition, we constructed best-and worst-case diets for each partitioned library by maximizing and minimizing, respectively, the correlation between dietary proportions and the differences between leave-one-prey-out estimates obtained with the partitioned and original libraries. Best-case diets were therefore concentrated on prey types whose leave-one-prey-out results most improved with the partitioned library, with the converse being true for the worst-case diets.
For each example diet described above, we compared the accuracy of diet estimation with the original and partitioned libraries by simulating the signatures of 5000 predators conditioned on each example diet and then estimating the diet each signature was based on. Predator signatures were simulated using methods previously described (e.g., Bromaghin et al. 2016b). Briefly, an independent bootstrap sample was drawn from the signatures of each prey type, with bootstrap sample sizes obtained using the method of Bromaghin (2015). Mean prey-type signatures were computed from the bootstrap samples and a predator signature was computed as a weighted average of the mean prey signatures, with the proportions of the example diet as weights. Calibration coefficients (referenced above) were then used to transform the simulated signatures from the prey space to the predator space, accounting for predator metabolism .
The diet composition used to generate each simulated signature was estimated in the predator space  with the Aitchison distance measure (Aitchison 1986) using both the original and partitioned libraries. Estimates obtained with the partitioned libraries were pooled back to the original prey types to permit comparison with the results obtained using the original libraries. The mean and standard error of the estimates for each diet were computed by prey type, and the mean was compared to the known diet mixture to estimate bias. Summary measures across all prey types were computed as the sum of the absolute values of bias and the square root of the sum of variances. All computations were made using R version 3.3.0 (R Core Team 2016). All QFASA-specific computations, including the DIMAC algorithm, leave-one-prey-out analyses, simulating predator signatures, and estimating diet composition, were made using version 1.0.0 of the R package qfasar (Bromaghin, in press).

DIMAC results
The DIMAC results from the fish library led to an initial partition of the 28 original prey types into a total of 54 clusters, with the number of clusters per prey type ranging from 1 to 4. Among the 17 prey types that were partitioned, the reduction in total distance between individual signatures and cluster mean signatures ranged from 17% to 70% and averaged 40%.
With the mammal library, DIMAC results led to an initial partition of the seven original prey types into 13 clusters, with one prey type not partitioned and the other six types partitioned into two clusters each. Among the six prey types that were partitioned, the decrease in total distance between individual signatures and cluster means ranged from 10% to 23% and averaged 15%.
An example of the large reduction in total distance between individual and mean signatures that occurs when hidden structure is discovered is provided by the gaspereau (Alosa pseudoharengus) prey type of the fish library (Fig. 1). Moving from one to two clusters resulted in a modest reduction in distance, but partitioning the data into three clusters produced a substantial reduction. As the data were partitioned into more than three clusters, the total distance declined rather gradually to zero. We interpreted this result as an indication that three clusters were likely appropriate for this prey type. The substantial distance reduction when gaspereau were partitioned into three clusters is representative of what we observed with most prey types that were partitioned, especially with the fish library.
As an aside, newly identified clusters in both prey libraries had varying degrees of correspondence with auxiliary information (results not shown). Nearly all of the fish library prey types that were partitioned showed moderate to perfect association with information such as the location or year of sample collection. Results with the mammal library were less striking. The partition of beluga whales (Delphinapterus leucas) into two clusters matched sampling location quite well, and the partition of bowhead whales (Balaena mysticetus) into two clusters was related to whale length. However, clusters of the other mammal prey types did not show any obvious relationships with auxiliary information.

Leave-one-prey-out analyses
The comparison of leave-one-prey-out results with the original and initial partitioned libraries led us to iteratively fine-tune the partitions of both libraries. Fine-tuning was motivated by the finding that initial partitions for some prey types only marginally improved leave-one-prey-out results or increased confounding between some clusters within other prey types. Consequently, we reassessed the DIMAC results in an attempt to improve our original partitions and again assessed the revised partition using an additional leave-one-prey-out analysis.
Three fine-tuning iterations with the fish library led to a final partition containing 53 clusters (Table 1), with the total reduction in distance for the 16 prey types that were partitioned ranging from 17% to 70% and averaging 40%. The proportion of prey associated with the correct prey type averaged across prey types improved by 32%, and the average across individual prey increased by 55% (Table 1).
For the mammal library, three iterations led to a final partition containing 10 clusters (Table 2), with the total reduction in distance for the three partitioned prey types ranging from 14% to 23% and averaging 18%. Consistent with the lower reduction in total distance obtained with this library, the proportion of prey associated with the correct prey type averaged across prey types improved by 2%, with only the spotted seal (Phoca largha) prey type improving modestly, and the average across individual prey increased by 1% (Table 2).

Example diets
The bootstrap sample size algorithm (Bromaghin 2015) indicated that samples of three herring (Clupea harengus) and one from each of the other prey types in the fish library produced simulated predator signatures with realistic Results based on the realistic diets confirmed our expectation that improvement in diet estimation was dependent on reduced prey confounding for those prey types contributing to diet. The gray seal diets had contributions from several prey types whose leave-one-prey-out results improved with partitioning, for example, herring and redfish (Sebastes fasciatus), and diet estimates with the partitioned library improved correspondingly (Tables 3-6). Among these four diets (combinations of sex and season), the sum of absolute bias across prey types decreased between 25% and 45% and the square root of the sum of the variances decreased between 19% and 32% with the partitioned library. The smallest improvement occurred with the diet of males sampled in spring (Table 4), which had a larger contribution from pollock (Pollachius pollachius), a highly confounded prey type whose leave-oneprey-out results slightly worsened with partitioning (Table 1). Estimates for the realistic polar bear diets obtained with the original library were superior to those obtained with the partitioned library. The sum of absolute bias across prey types increased 26% for the adult female diet and 31% for the adult male diet with the partitioned  library, and the square root of the sum of the variances across prey types increased 10% and 7%, respectively ( Table 7). Results of diet estimation with the best-and worst-case diets supported our hypothesis that diet estimation would improve in the best case and worsen in the worst case. With the partitioned fish library, the sum of absolute bias across prey types decreased 41% and the square root of the sum of the variances decreased 22% for the bestcase diet. Conversely, with the worst-case diet, the sum of absolute bias increased 128% and the square root of the sum of the variances decreased 14% (Table 8). With the partitioned mammal library, the sum of absolute bias increased for both the best-and worst-case diets, 98% and 38%, respectively, and the square roots of the sums of the variances decreased slightly with both diets, 13% and 4%, respectively (Table 9).

DISCUSSION
Our DIMAC approach has obvious value as a data exploration tool in QFASA investigations, detecting hidden structure in prey signature data that can potentially be used to refine prey-type designations and thereby improve diet estimation. The reassignment of each signature to the nearest magnet during each iteration of the algorithm, independently from its cluster assignment in the preceding iteration, gives the algorithm extreme flexibility to conform to whatever structure exists within signature data. The correspondence between clusters and auxiliary information that we observed, particularly with the fish library, is interesting and provides some evidence that the method detected true structure within the data, rather than random noise. At the same time, we do not wish to overemphasize the importance of such correspondence, as exploiting the existence of hidden structure in a prey library has the potential to improve diet estimation irrespective of whether or not available auxiliary data help explain that structure. The number of clusters to select for each prey type is, unavoidably, a subjective decision. Our recommendation is to apply the approach conservatively and only partition prey types when doing so leads to a substantial reduction in distance and does not increase confounding among prey types that may contribute importantly to diet. Based on our results with these two prey libraries, meaningful improvement in leave-oneprey-out estimates tended to be associated with partitions providing a reduction in total distance of at least 20%, with reductions exceeding 30% performing even better. However, determining whether a threshold of this magnitude might be suitable for other libraries will require additional experience with the method. Until more universal guidance emerges from additional experience, investigators who wish to experiment with this technique may need to conduct an evaluation with their specific prey library following our examples.
If the number of clusters within a partitioned library exceeds the number of fatty acids in (potentially augmented) signatures, as was the case here with the fish library, investigators may need to balance the competing problems of bias vs. non-uniqueness in diet estimates. Using fewer than the optimal number of clusters, to avoid non-uniqueness caused by the number of clusters exceeding the number of fatty acids, may increase prey confounding and thereby reduce estimation accuracy. Conversely, if the number of clusters exceeds the number of fatty acids, diet estimates may not be unique. To the extent that any lack of uniqueness occurs between partitioned clusters of original prey types, the problem can be mitigated by pooling estimates back to the original prey types (e.g., Meynier et al. 2010), and pooling to an even smaller number of "reporting groups" can further reduce bias and variance (Bromaghin 2008). However, there is no guarantee that pooling estimates will completely remedy nonuniqueness. An intermediate approach in which the most distinct clusters are partitioned first and  the total number of clusters is constrained to not exceed the number of fatty acids may be worth consideration. However, the most effective strategy in any particular investigation seems likely to depend on the structure of the prey library and typical predator diet composition, and a small simulation study may be necessary to inform the choice.
The DIMAC approach seems likely to be useful for other types of data exploration in QFASA studies. The algorithm could easily be applied to all prey or predator signature data. In such cases, one would be interested in how well clusters align with auxiliary information associated with the signature data. The results might support expectations, or perhaps reveal unexpected relationships that lead to the generation of new hypotheses regarding ecological mechanisms underlying observed patterns. For example, our finding that bowhead whale cluster membership was associated with whale length was unexpected and raises questions regarding ecological mechanisms that we cannot yet answer, but suggests that their diet or metabolism may change with size or age. Investigators sometimes search for patterns in diet estimates (e.g., Bromaghin et al. 2013, Galicia et al. 2015 and the DIMAC method could be applied to vectors of diet estimates as easily as signatures. However, because diet estimates incorporate estimation variance and potentially the influence of calibration coefficient inaccuracies (Bromaghin et al. 2016b), it might be preferable to first use the method to look for patterns in predator signatures rather than estimated diets.
The results we obtained with the fish and mammal libraries seem likely to exemplify the range of results that might be obtained with other libraries. With the fish library, we found partitions for many prey types that substantially reduced total distance between signatures and the cluster mean. Exploiting this structure reduced prey confounding, as measured by leave-one-prey-out analyses, and improved diet estimates. Conversely, none of the prey types within the mammal library appeared to be comprised of distinct clusters and, consequently, partitioning that prey library provided no benefits. The lack of improvement with the mammal library was not completely unexpected. This library has a simple structure, being comprised of only seven species, several of which have distinctive mean signatures (Table 2). Prey confounding within the mammal library primarily occurs among the three small ice seal species (ribbon [Histriophoca fasciata], ringed, and spotted seals; Bromaghin, in press), but partitioning those prey types did not meaningfully improve discrimination between them.
One potential consequence of partitioning signature data into clusters that we did not fully anticipate is the potential to increase confounding among some prey types. We generally partitioned prey types when doing so led to a substantial reduction in the total distance between individual and mean signatures, so leave-one-prey-out estimates for those prey types tended to improve substantially. However, partitioning sometimes produced increased confounding between the new clusters and other prey types. The discovery of this effect led us to implement an iterative approach in which we attempted to balance improved performance for some prey types and elevated levels of confounding among others.
Our results support previous conclusions that the performance of QFASA diet estimators depends strongly on the interaction between prey library characteristics and predator diet composition (Bromaghin et al. 2016a, b). Partitioning the fish library improved its structure with respect to several species of fish that were meaningful components of the realistic gray seal diets, so use of the partitioned library produced superior estimates. Similarly, the results for the best-and worst-case diets for the fish library clearly illustrate the importance of the diet-library interaction for diet estimation. When diet composition was concentrated on prey types of the fish library whose structure improved with partitioning, diet estimates improved; the converse was also true.

CONCLUSIONS
We propose the DIMAC method as a valuable new tool to aid in the exploration of fatty acid signature data. Basing data exploration and diet estimation on identical methods creates an internal consistency within an analysis, noticeably lacking in prior applications of QFASA, which can be expected to enhance the effectiveness of data exploration to inform decisions necessary for diet estimation. The discovery of hidden structure within a prey library can be exploited to produce substantially improved diet estimates if the structure of the partitioned prey library reduces confounding among prey consumed by predators. In addition, the method may reveal unexpected patterns in signatures that generate hypotheses regarding ecological mechanisms to explain those patterns and ultimately lead to a refined understanding of either predator or prey ecology. The DIMAC algorithm is available for use by other investigators in the R package qfasar (Bromaghin, in press).