Differences among Major Taxa in the Extent of Ecological Knowledge across Four Major Ecosystems

Existing knowledge shapes our understanding of ecosystems and is critical for ecosystem-based management of the world's natural resources. Typically this knowledge is biased among taxa, with some taxa far better studied than others, but the extent of this bias is poorly known. In conjunction with the publically available World Registry of Marine Species database (WoRMS) and one of the world's premier electronic scientific literature databases (Web of Science®), a text mining approach is used to examine the distribution of existing ecological knowledge among taxa in coral reef, mangrove, seagrass and kelp bed ecosystems. We found that for each of these ecosystems, most research has been limited to a few groups of organisms. While this bias clearly reflects the perceived importance of some taxa as commercially or ecologically valuable, the relative lack of research of other taxonomic groups highlights the problem that some key taxa and associated ecosystem processes they affect may be poorly understood or completely ignored. The approach outlined here could be applied to any type of ecosystem for analyzing previous research effort and identifying knowledge gaps in order to improve ecosystem-based conservation and management.


Introduction
Existing knowledge shapes our understanding of ecosystems and determines our ability to identify what drives ecosystem function and promotes ecosystem resilience and understand the nature and role of keystone species. Such information is critical to the successful conservation of the world's biodiversity and increasingly underpins management, particularly the broad approach referred to as ecosystem-based management (EBM). However, while a large and growing body of ecological knowledge is stored in the scientific literature, representing a broad range of the world's ecosystems, this existing knowledge may not adequately represent the range of taxa present in these ecosystems. An understanding of this potential bias is becoming increasingly urgent as biodiversity is lost [1], ecosystems are degraded [2] and the vitally important goods and services that they provide are threatened [3,4].
By concentrating on four major marine ecosystems, we examine the taxonomic distribution of existing ecological knowledge and the extent to which various taxonomic groups may be under-or over-represented in our knowledge of these systems. We analyzed the literature for coral reefs, seagrass beds, mangroves and kelp beds because these ecosystems provide important ecosystem goods and services both individually and via functional linkages [4,5,6,7,8,9,10]. In addition, each ecosystem is relatively discrete and therefore easy to delineate and is defined by its dominant habitat-forming organisms. Also, these ecosystems are at consid-erable risk from both direct and indirect anthropogenic pressures such as pollution, development, overfishing and now global warming and ocean acidification [11,12]. Indeed, few if any areas remain where these ecosystems have not been impacted to some extent [11]. Therefore, now more than ever, it is important to assess our current knowledge of these ecosystems and consider how future research efforts may be best allocated to maximize our chance of achieving sustainable management. Using text mining (following [13]) of papers contained within the Web of ScienceH (WoS), we examined how existing ecological knowledge and associated research efforts are distributed among different taxonomic groups.

Sampling the scientific literature
Web of ScienceH (WoS) is one of the world's largest literature databases and includes much of the published information relevant to marine ecology. WoS was searched between the years 1957-2009 by using the following keywords: ''coral reef/s'', ''mangrove forest/s'', ''kelp forest/s'' ''seagrass bed/s'' and ''seagrass meadow/s''. ''Coral'', ''mangrove'', ''kelp'' and ''seagrass'' were not used on their own as search terms. This was done to ensure that returns were relevant to the ecosystems of interest, rather than simply including all possible studies of these particular organisms. The resulting 13,229 papers were exported in EndNoteH format and transferred to Microsoft AccessH. Structure Query Language (SQL) was then used to further limit the resulting data to those containing these same search terms in the title, keywords (author keywords only [13]) or abstract fields. The filter resulted in a set of 9303 papers, less than the number produced by the WoS search, because it removed papers that refer only to these search terms in the KeyWords PlusH field [14,15] or in other WoS fields not used here.

Taxonomic assignment
Text from the title, keywords and abstracts was matched against scientific names contained within the World Registry of Marine Species (WoRMS [16]). To achieve this, the open source statistical programming language R [17] was used to generate a vector of all unique single and double word sequences from the text of the title, abstract and keywords, which was then matched against WoRMS. Research papers were limited to those that could also be assigned to a taxonomic group at the phylum level or better. For simplicity of interpretation, results were limited to the Animalia, Plantae and Chromista kingdoms; within the WoRMS database these kingdoms encompass all animals, plants (including red and green algae) and brown algae, respectively. Because taxonomic assignment was based entirely on the WoRMS database, the patterns observed here depend on the named taxa occurring therein. Valid species named in WoRMS are 87% checked by taxonomic editors and represent 87% of the estimated named marine species. WoRMS contains synonyms as well as valid taxonomic names. Papers containing a match to a taxonomic name listed as a synonym in WoRMS were assigned to the valid taxonomic name for that synonym.
Taxonomic names were searched only in the title, keywords and abstracts. Therefore, all relevant literature may not have been captured, particularly for ecological research where functional roles were the focus of titles and abstracts with specific taxa mentioned only in the text. Only some literature is available as full text in a searchable electronic format, thus making it difficult to expand the search beyond the fields we searched. In addition, full text will in many cases refer to taxa names that are not the focal species of a study, but are instead discussed to provide context for the results being reported. Rather than retrieving every publication that referred to a particular species, our goal was to develop a relative index of research effort. The taxonomic patterns in research effort reported here thus assume that the ratio of literature containing specific taxonomic names (in the titles, keywords and abstracts) relative to those that do not include such specific information in these fields are equivalent across major taxonomic groups. Furthermore, it is likely that ecological studies that do not include specific taxonomic information focus on betterknown taxa; thus, any patterns of bias in research effort would be reinforced if this literature were also included.

Analyses
The number of papers, classes and species occurring in the literature was calculated for each of these four ecosystems and within each phylum. Shannon's evenness [18] index at both the species and class levels was calculated as 2gPi ln(Pi)/lnS, where S is the number of species and Pi is the proportion of total abundance of the ith species. Chao's [19] estimates of species richness and taxonomic distinctness [20] were also calculated using the vegan [21] package in R. Taxonomic distinctness is a measure of the average distance between all pairs of species in a taxonomic tree, which captures phenotypic differences and functional richness [20]. Taxonomic distinctness was calculated across the whole data set, as well as for three separate periods (prior to 2000, [2000][2001][2002][2003][2004][2005][2006][2006][2007][2008][2009]. This selection of periods divided the literature into roughly equal-sized sample bins, allowing us to examine how the taxonomic breadth of research effort has changed through time. Individual-based rarefaction [22] was used to graphically examine species richness with increasing sample effort (number of papers) among the four ecosystems.
The numbers of papers within each phylum, class and species were calculated and frequency histograms were used to examine patterns in the number of papers within different classes across the four ecosystems. The probability of occurrence within the literature was estimated for each class by fitting binomial models using the function glm of the package stats [23] in R and equations detailed in [24]. To explore the relationship between research effort and global known richness of named species, we plotted the total number of research papers as a function of the total number of valid species contained within WoRMS. Trends in this relationship were analyzed using Generalized Additive Mixed Models (GAMM) [24] and were fitted using the function gamm in the mgcv [25] package in R. Both the number of papers and number of species were log 10 transformed to remove ''trumpeting'' of variances. To remove some taxonomic non-independence, phylum was included as a random effect. Deviations (residuals) from these GAMMs were used as an estimate of species richness corrected by research effort for each class. To determine the most well-studied taxa, different classes were ranked based on the probability of occurrence in the literature across all four ecosystems, as well as research effort corrected for species richness as described above.

Results
A total of 2380 unique species from 78 taxonomic classes of marine organisms of the kingdoms Animalia, Chromista and Plantae as defined by WoRMS were detected in the ISI indexed literature for these four major marine ecosystems (Table 1). This total represents 57% of the valid classes from these Kingdoms contained in WoRMS (Table 2). Coral reefs dominated in terms of the diversity of taxa studied, with at least one paper found for 1580 species from 66 classes ( Table 1). Coral reefs were followed in diversity by 597 species in the seagrass bed literature (50 classes), 201 species in the mangrove forest literature (38 classes) and 131 species (22 classes) in the kelp forest literature. Chao's estimators of species diversity followed a similar pattern to raw species richness, with greatest diversity found in the coral reef literature, followed by literature on seagrass beds, mangrove forests and kelp forests ( Table 1, Fig. 1). Patterns in taxonomic distinctness (a measure of the average distance between all pairs of species in the taxonomic tree) differed from species richness, with greatest distinctness occurring in the mangrove and kelp forest literatures, followed by the seagrass bed literature and finally with the coral reef literature being the least taxonomically distinct, indicating that a smaller range of taxonomic groups are well represented (Table 1, Fig. 2). Within each ecosystem, taxonomic distinctness was greatest for literature dating prior to 2000 and was less distinct or similarly distinct for the two more recent time periods (2000-2006 and 2006-2009). Across all ecosystems and time periods, the smallest value for taxonomic distinctness was found for the most recent research on coral reefs. The number of papers for different classes in the four different ecosystems indicated that research has been highly uneven with respect to the taxa investigated, with a very small number of classes having being the subject of the bulk of the research effort to date ( Table 1). The number of research papers within each class for all four ecosystems was positively related to the total number of species recorded in the World Registry of Marine Species (WoRMS) [16] (Fig. 3). However, this relationship was relatively weak (R 2 values ranged from 0.28 to 0.52), with some classes showing considerably greater research effort relative to their known species richness and others much lower (Fig. 3). Summed across all ecosystems, the Actinopterygii (fishes) were the most frequently studied class, with some 1559 papers ( Table 2). Within all four ecosystems, fish research was also a large component. For coral reefs fishes were the subject of 30.6% of research papers (Table 2). For all three of the other ecosystems, fishes were also one of the most studied groups by actual numbers of papers published (Table 2), showing a high probability of occurrence in the literature, as well as when research effort (numbers of papers) was corrected for species richness (Fig. 4).
Research on all four ecosystems was also largely focused on research on their respective habitat-forming classes (e.g. 24.3%, 41.2%, 49.3% and 34.6% of the research papers for coral reefs (Anthozoa), kelp forests (Phaeophyceae), mangrove forests (Magnoliopsida) and seagrass beds (Liliopsida), respectively; Table 2). This dominance of research on habitat-forming classes is reflected in the high probability of occurrence in the respective literature, and remains after correcting research effort for species richness (Fig. 4).
Along with fishes, and the habitat forming taxa, several other classes were well studied across a range of ecosystems. The Echinoidea, which occurred frequently in the coral reef, kelpforest and seagrass literature, were also studied more than expected given their species richness (Fig. 4). While the Malacostraca and Gastropoda contributed substantially in terms of total numbers of papers (Table 2) these groups were apparently studied relatively less than expected given their species richness (Fig. 4). In contrast, there are several classes (e.g. Mammalia, Reptilia and Ulvophyceae) that, while not contributing much to the literature in terms of total numbers of papers, have clearly been studied relatively more than expected given their species richness (Fig. 4).
A wide range of classes also appears to have been studied relatively less than expected given their species richness (Fig. 4). Many of these belong to the phylum Arthropoda (Arachnida, Cephalocaridae, Maxillipoda and Pygnogoda), but also included here was a group of nemertean worms (Adenophorea), brittle stars (Ophiuroidea) and glass sponges (Hexactinellida). For many taxonomic classes, few (,10) or no research papers were found for any of these ecosystems (Table 3).

Discussion
In the four ecosystems studied, a majority of the research has concentrated on only a few groups of organisms. Although there was a positive relationship between (named) global marine species richness and research effort among different taxonomic classes, some groups were greatly overrepresented in the scientific literature relative to their named species richness while others were greatly underrepresented. To some extent reflective of the economic or perceived ecological significance of some taxa over others, this imbalance suggests that key taxa and ecological processes may be poorly understood. Given that known diversity must also depend to come extent on previous research effort, some of the groups reported here as being understudied are likely to be more diverse than currently recognized. Indeed, undiscovered species of fishes (Pisces) are estimated to be 20-30% of the known fauna, whereas less studied groups such as sponges and platyhelminthes are in the order of 200-300% and nematodes more than an order of magnitude more [26]. If less-studied groups contain more undiscovered species, the extent of the bias we report may be underestimated. Further, if ecological papers (which likely  do not list taxonomic names in the text fields considered herekeywords, title and abstract) are biased towards better studied taxa, the disparity between well-studied and poorly-studied taxa may be even more pronounced.

Variation among taxa in research effort
Among the better-studied groups in all four ecosystems were the dominant habitat-forming organisms: corals, kelps, mangroves and seagrasses. These species provide the physical structure that allows them to host associated species as well as provide other ecosystem goods and services [6,7,8,27,28,29,30,31]. Given the importance of these taxa to the functioning of these ecosystems, it is expected and appropriate they have been well studied.
Other groups of well-studied taxa were those that are commercially important, large and conspicuous, or which perform other key functional roles in some ecosystems. For all ecosystems, fishes were one of the best-studied taxa. This was true even when the species richness of this group was taken into account. Again, this emphasis on fishes is not surprising, as they are the most widely distributed and diverse vertebrates on earth [32]. Fish are also of great economic value as food and because of their aesthetic value to tourists. Fishes also contribute to critical processes in ecosystem function with some considered keystone species (e.g. [33]). Aside from fishes, other potentially commercially-important taxa that have been frequently studied, including gastropods, bivalves, malacostracan crustaceans and echinoids are important herbivores in a range of ecosystems [34,35,36]. Other well-studied taxa (especially relative to their overall diversity) are other large and conspicuous groups, such as mammals and reptiles. These groups also tend to have high conservation value often being endangered or threatened or playing key functional roles [37], and high economic value in tourism and artisanal fisheries.
Greater than average research effort afforded to some taxonomic groups may be appropriate given their economic and ecological importance. Indeed, even for the most studied taxa, the fishes, some 21% of species across all habitat types remain to be described globally and at fine spatial scales (350 km 2 spatial resolution) only a tiny fraction of the world's oceans have their fish fauna more than 80% described [38]. Therefore, it seems likely that even in well-studied classes (such as fishes) much of our knowledge is sparse and unevenly distributed among their constituent species.
Across all four ecosystems, a large number of classes were not represented in the literature or have received very little research attention relative to their known diversity. The extent to which these groups are truly understudied depends largely on their actual prevalence in these ecosystems. We do not have information on species richness and abundance for all potentially important taxonomic groups for any of these ecosystems and thus our analysis is based necessarily on named global marine taxa (as currently recorded by WoRMS). There is no doubt that some of these groups remain understudied in some ecosystems because they are not a dominant feature there, and/or the bulk of their diversity is found elsewhere. For example, one of the least-studied  groups among all four ecosystems was a class of Porifera known as the glass sponges (Hexactinellida) which, while relatively numerous, are most common in deepwater and the Antarctic [39] and are largely lacking in the ecosystems studied here. Some other groups that remain poorly studied in these shallow water ecosystems may also be largely absent. Using information available online, we attempted to provide an indication of whether each class is likely to be represented in any one of the ecosystems considered here (Fig. 4, Table S1). However, reliable information on habitat affiliations of marine taxa is still largely unavailable for many relatively understudied taxa. A detailed examination of the geographic and ecological distribution of each group would help to Table 3. Classes of marine Phyla (or Division) occurring in the World Registry of Marine Species (WoRMS) with less than 10 occurrences in the Web of ScienceH indexed literature for any of the four ecosystems. elucidate the extent to which understudied groups are in fact underrepresented in these ecosystems relative to their potential importance. While some taxa may be justifiably ignored in these four ecosystems (e.g. if they do not commonly occur there), some highly speciose groups are underrepresented in the literature and may be very important in these ecosystems. Compared to their described diversity, several classes of Arthropoda have been poorly studied in all four ecosystems, and are likely to be prevalent in some (Fig. 4, Table S1). In terrestrial ecosystems, arthropods are highly diverse [40] playing many functional roles [41]. Similar patterns and breadth of ecological function are likely to occur in marine environments. In addition to some of the Arthropoda, several other groups of benthic invertebrates were also understudied with respect to their described diversity. Benthic invertebrates more generally are likely to play an important role in many ecosystems as they span all trophic levels, are important food sources at higher trophic levels and perform crucial roles in bioturbation, oxygenation, nutrient cycling and transport and processing of pollutants [42].

Variation among ecosystems in taxonomic diversity of research
Considerable differences were evident among these four ecosystems in terms of the total (and expected) species richness represented in their respective literatures and their taxonomic distinctness. The coral reef literature reported on more species than other literatures but also had the lowest level of taxonomic distinctness. Taxonomic distinctness is a measure of the average distance between all pairs of species in the taxonomic tree and low values suggest that the bulk of research is on a limited range of taxonomic groups. Without complete community inventories for these ecosystems, it is impossible to know if the patterns represented by the coral reef literature accurately reflect their community structure, or are a result of particularly biased efforts in research on coral reefs (e.g. a bias favouring corals and fishes, because other groups are much harder to enumerate and identify, or because of a bias in research funding). If the patterns observed reflect greater research bias on coral reefs compared to other ecosystems, this suggests that our capacity to understand and model these complex ecosystems is less than in others. Further, the bias towards research on a limited subset of coral reef taxa is greatest in recent literature, suggesting that the situation is getting worse. This is likely in part because the earliest period that we examined was considerably longer (.40 years) than the other two, and as such involved several generations of scientists, potentially with more varied expertise. However, despite their differences in length, these categories were defined by having similar numbers of publications. The progressive shortening of these periods and the decrease in taxonomic distinctness thus indicate increased research effort is more focused on corals and fishes.
The large taxonomic biases in research effort observed here are likely to be exacerbated, in part, by the dynamics of the current research funding culture. As more research is done on a particular group (e.g. corals and fishes), these groups begin to assume the status of model systems, whereby future research can be leveraged off previous advances in knowledge. While the use of model systems in this way can find favour with reviewers of grant applications and funding agencies, and can have some advantages in terms of building specialist knowledge of particular parts of ecosystems, given finite resources more general knowledge of these systems must be traded off. Such trade-offs may be acceptable where the knowledge gained is applicable to other components of the ecosystem of interest. Such equivalency, however, is not always safe to assume [43], nor easy to test, where data on key species and/or functional groups do not exist. Biases in research effort are also likely to arise when taxonomic expertise is limited and focused on particular taxa. It is well documented that taxonomic effort does not tend to reflect true biological diversity [44] and certain groups are more likely to get identified (and are thereby studied more readily) than others, simply due to their being more taxonomists working on that group.

Future Allocation of Research Effort Among Taxa
Our results indicate an imbalance in research effort among major taxonomic groups for the four marine ecosystems examined. However, it remains difficult to assess the best way to allocate limited research capacity towards future efforts. Research programs driven solely by the immediate needs of management risk overlooking new insights and opportunities [45]. Conversely, research focused beyond these immediate concerns risk being perceived as irrelevant [45]. Conservation status (or success) is often measured by monitoring target taxa thought to act as indicators of ecosystem health and/or function or biodiversity as a whole. Several criteria are important for selecting indicator taxa [46], but to apply these criteria effectively, considerable ecological knowledge is required, thus limiting the choice of possible indicators to a small range of taxa that may or may not prove adequate for monitoring the health of ecosystems. Likewise, biological surrogates (typically well known and easy to survey groups) are often used as a means of assessing biodiversity patterns without having to resort to exhaustive surveys [47]. However, cross-taxon surrogates are rarely effective, and research focused on only a few select taxa is unlikely to provide good predictors of the wider taxonomic diversity or functioning of an ecosystem [10,43,48]. While research should, and will, continue on many well-studied groups, in our opinion, if we are to improve the effectiveness of ecosystem-based management and conservation, more effort needs to be directed towards understanding a broader range of taxa and their interactions.

Supporting Information
Table S1 Complete list of taxonomic classes for which there was at least 1 occurrence in the literature indexed in Web of ScienceH for any of the four ecosystems. (DOC)