The utility of the ‘Arable Weeds and Management in Europe’ database: Challenges and opportunities of combining weed survey data at a European scale

Abstract Over the last 30 years, many studies have surveyed weed vegetation on arable land. The ‘Arable Weeds and Management in Europe’ (AWME) database is a collection of 36 of these surveys and the associated management data. Here, we review the challenges associated with combining disparate datasets and explore some of the opportunities for future research that present themselves thanks to the AWME database. We present three case studies repeating previously published national scale analyses with data from a larger spatial extent. The case studies, originally done in France, Germany and the UK, explore various aspects of weed ecology (community composition, management and environmental effects and within‐field distributions) and use a range of statistical techniques (canonical correspondence analysis, redundancy analysis and generalised linear mixed models) to demonstrate the utility and versatility of the AWME database. We demonstrate that (i) the standardisation of abundance data to a common measure, before the analysis of the combined dataset, has little impact on the outcome of the analyses, (ii) the increased extent of environmental or management gradients allows for greater confidence in conclusions and (iii) the main conclusions of analyses done at different spatial scales remain consistent. These case studies demonstrate the utility of a Europe‐wide weed survey database, for clarifying or extending results obtained from studies at smaller scales. This Europe‐wide data collection offers many more opportunities for analysis that could not be addressed in smaller datasets; including questions about the effects of climate change, macro‐ecological and biogeographical issues related to weed diversity as well as the dominance or rarity of specific weeds in Europe.


| INTRODUCTION
Weed vegetation surveys are commonly used in weed science to assess the impacts of agricultural practices on weed flora (e.g., Pinke et al., 2012), determine causes of yield losses (e.g., Adeux et al., 2019), monitor conservation efforts (e.g., Kolářová et al., 2013) and map species distributions (e.g., Hanzlik & Gerowitt, 2012). Many surveys are designed for a particular purpose and so the collected data, and associated metadata, are specific to the particular research question. As such, the resulting local databases can contain different methodologies and data types (Bürger et al., 2022). Despite such discrepancies, there is potential for added benefits when surveys are analysed in combination, as demonstrated by the success of several plot-vegetation databases (e.g., European Vegetation Archive [EVA] [Chytrý et al., 2016], European Weed Vegetation Database [Küzmič et al., 2020]). The Arable Weeds and Management in Europe (AWME) database (Bürger et al., 2020) provides a new resource for combining arable weed survey data and complementary management information from across Europe.
The value of combining data or findings from multiple studies is clear, as illustrated by several meta-analyses aiming at improving our understanding of the response of weed communities to drivers of change (e.g., Gu et al., 2021;Richner et al., 2015). Meta-analyses use previously published results to understand the net effect of a specific driver on weed community response. However, the utility of this approach is limited by the accessibility of published statistical results and appropriate measures of confidence in published articles. There is, therefore, scope to provide more robust analyses by returning to the raw data, as demonstrated by the success of similar data collections in adjacent scientific disciplines (e.g., CESTES database [Jeliazkov et al., 2020], which is dedicated to analyses at the metacommunity level including species traits). However, the key challenges facing analysts using the AWME database (Table 1) has not been widely explored.
Here, we will explore the utility of the AWME data collection using three case studies. In each case study, we will focus on an analysis previously published using data from a national scale weed survey (each of which is a component dataset within the AWME database).
Each of the original studies posed a different question about weed communities and their ecology. Using a three-stage process we will determine whether (a) the data contained within AWME allow this question to be addressed at the European scale, (b) it is possible to overcome key challenges associated with analysing data gathered for different purposes and (c) the conclusions of the original publication and our reanalysis remain consistent across scales. Through this work, we aim to examine the utility of the AWME data collection for European scale analysis and guide future analysts in the best practice for using the AWME data collection to avoid key challenges associated with data collections of this kind.

| The AWME database
In 2019, the working group Weeds and Biodiversity of the European Weed Research Society set out to form a data collection of primary arable weed vegetation records. The resulting AWME (Bürger et al., 2020) database currently comprises 36 surveys of arable weed vegetation conducted between 1996 and 2018 across 12 countries (Tables S1 and S2) and contains >40 000 observations of weed vegetation. The unifying feature of these records is that each consists of a list of species found on a plot of a specific size from an arable field or its margin. Observations are complemented by metadata including the survey date and geographic coordinates. The distinguishing feature of the AWME database (from other collections such as EVA) is that observations are supplemented with information on the agricultural management of the survey site, making the data more useful to agronomists and weed scientists seeking to understand the impact of agricultural practices on weed communities. These agricultural T A B L E 1 Key challenges associated with the analysis of data coming from multiple weed vegetation surveys and potential solutions that are incorporated within the Arable Weeds and Management database and our analyses.

Challenge
Potential problems How it is addressed within AWME How we address it in our analysis Relevant information on timing of each survey is provided. Users are encouraged to focus on the timing of the survey relative to the crop phenology rather than the absolute date.
Considered in the analysis directly (FR) or indirectly via using the survey as a random effect (UK).
(v) Balance of data and spatial sampling biases Imbalance in the relative size of datasets can introduce biases. Some datasets may be small in sample number and spatial extent, whilst others cover a wider geographic or temporal extent.
Not explicitly addressed within AWME as it may or may not present a problem for a specific analysis Randomly selected one observation from fields with multiple records (FR) Trialled a spatial subsampling procedure (DE) and fitted variograms to the response variables to test for spatial autocorrelation (UK).
(vi) Disparity in plot size between surveys Plot sizes vary according to the purpose of the survey, and available resources. Observations on larger plots will likely have higher species richness, and the co-occurrence of species may depend on plot size (Chytrý & Otýpková, 2003).
Information about survey methodology and plot size are included within AWME so that the user may make an informed decision as to which approach best suits their needs.

| The case studies
We selected three case studies where analyses had previously been  Heard et al., 2003) to study the effect of landscape features, environment and management on weed diversity and abundance in arable fields.

| Analysis
To recreate the case studies, we kept the analytical methods as true to each of the original publications as possible.
We used canonical correspondence analyses (CCA; ter Braak, 1986) for the FR case study. Following Lososová et al.
(2004), we tested for gross and net effects of each explanatory variable on weed species composition. The explanatory variable considered were latitude ( N), longitude ( E), mean annual temperature ( C), annual precipitation (mm), soil pH, crop type, previous crop, herbicide treatment (presence/absence), position in the field (core/edge), sampling season and year of sampling. Separate CCAs with a single explanatory variable were used to test gross effects.
The effect of a particular variable after partitioning out the effect shared with the other explanatory variables (i.e., net effect) was tested using partial CCAs (pCCA), each with a single explanatory variable and the other 10 (9) variables used as covariates (see Step 1 below for details of explanatory variables). Significances were tested by 1000 permutation tests. We used the ratio of a particular canonical eigenvalue over the sum of all eigenvalues (total inertia) as a measure of the proportion of variation explained by each factor. We used the cca() function from the vegan package in R (Oksanen et al., 2022)  In this case, gross effect was calculated by using all variables of a group and the net effect by using the variables of the other groups of covariates. Each of the groups were formed by different explanatory variables. For the variation partitioning, we used Hellinger transformation to avoid horseshoe effects and to reduce the weight of rare species (Legendre & Gallagher, 2001). We used 500 permutations of the analysis to test for significance. In the data from AWME, there were some correlations between environmental variables. However, to keep the analysis as similar as possible to the original study we opted to retain all variables in our analyses. We used the vegan package in R (Oksanen et al., 2022) to calculate the RDA and the variance partitioning respectively.
In contrast to the multivariate analyses in the FR and DE case studies, we used generalised linear mixed effects models (GLMMs) in the UK case study to investigate the effect of crop, herbicide treatment (presence/absence) and position in field (core/edge) on weed species richness and abundance. Species richness and weed abundance (obtained using count data) were assumed to follow a Poisson distribution and the rescaled data (see Table 1) were assumed to follow a normal distribution. We used the canonical link function (natural logarithm for Poisson responses, identity for normal responses). We estimated the dispersion parameter to account for over and under dispersion. We considered the following terms in the fixed effects model: position in field (core/edge), crop type, herbicide treatment (presence/absence). We also included the second and third-order interactions between these fixed effects. Terms were selected using backwards elimination according to the largest p-value given by an approximate F-test when that term was dropped (Kenward & Roger, 1997). The final predictive model was chosen when all remaining terms gave significant values (p ≤ 0.05) for an F test when dropped from the model. All statistical analyses were done using R (R Core Team, 2022), the GLMMs were fitted using Genstat (Payne, 2013) to correspond with the original publication.
To address our aims objectively, we took a three-step approach to separate the effect of any data transformations and changes in scale.

| Step 1: Reframing the scope
To address some of the challenges associated with combining disparate datasets several of the variables used in the original studies have been consolidated or simplified within AWME (Table 1) as such it was necessary to redefine the scope of each of our case studies and repeat the analysis from the original publications using only the original data in the form as it is contained within AWME. This gave us a baseline result within the original spatial extent and before any additional data transformations. In Step 1  Step 3, we again repeated each analysis with the transformed response variable used in Step 2 but at a wider spatial extent incorporating all relevant data from AWME from across Europe (Figure 1).
This final stage allowed us to assess whether the changes introduced in the previous two steps leave sufficient remaining records to make data analysis possible at a continental scale and to better understand whether results obtained from analyses at a national scale are truly representative of universal concepts within weed ecology.
We avoided the issue of spatial biases in the FR case study by randomly selecting one observation from fields with multiple records across all three steps of our analysis. To account for potential spatial bias (Table 1) in the European scale analysis of the other two case studies, we additionally trialled a spatial subsampling procedure (DE, see Figure S1) and fitting variograms to the response variable to test for spatial autocorrelation (UK), however, these were found to be unnecessary and so the results are not presented here.

| RESULTS
In our FR case study, crop type was the top-ranked explanatory variable in all three steps of our analysis ( Table 2) Table 3). Focusing only on environmental variables (Figure 3), the combination of climate and geographical position explained most of the variation, whilst the net effects of climate, geographical position and soil were relatively small.
In our UK case study, both species richness and abundance were consistently higher at the field edge than in the field core ( Figure 4).
We also consistently find a strong effect of crop on both species richness and abundance ( the analysis yielded similar results to those seen in the original publications. The transformation of response variables in step two of our analysis caused little change to the results observed. For example, switching from an analysis based on abundance to one based on presence/ absence data in the FR case study gave a reduction in the explained inertia from 7.2% to 6.6%, but the relative importance of the explanatory variables was almost identical (  Step 1. Germany (in count) Step 2. Germany (in Barralis scale) Step 3

| DISCUSSION
The consistency in results between the original studies and those we obtained in step one of our analyses indicates that despite the  Step 1. UK (counts) N = 24 432 Step 2. UK (rescaled to zero mean and unit variance) N = 24 432 Step 3. Europe consolidation of several environmental and management variables and an associated reduction in the resolution of the data, this has little impact on the ability to answer important ecological questions using the data stored within AWME. In all three case studies, we were able to alter the scope of the question slightly to allow the inclusion of additional data sets from across Europe.
We identified several key challenges associated with combining data from multiple surveys. Some of these challenges are addressed within the AWME database itself, but for others it was necessary to address them in our analysis (Table 1). Through our three case studies, we highlighted some exemplary solutions to the challenges of combining disparate datasets. There is inherent data loss when analysing data from multiple sources as not all information is available from each source. In the case studies, we found that the resolution of explanatory variables was often coarser than in the original studies. There was also information loss from the exclusion of incomparable records or in the transformation of response variables (e.g., from counts to presence/absence). We addressed the challenge of different species abundance metrics by transforming the data to a common scale, which is known to influence the outcome of ordination analyses (Otypková & Chytry, 2006). However, we found that this transformation had little influence in our analysis. In case studies FR and DE, there was minimal impact on the percentage of explained inertia, and in case study UK, the variables identified as significant in the GLMM remained similar in each step of the analysis. This may be explained by the fact that the original datasets from France, Germany and the UK, were collected on a national scale representing a long gradient of 'heterogeneous' data with almost unique species composition in each plot. In this case, presence-absence data is sufficient to describe between-site variation. Whereas, in the case of more homogeneous data at the scale of a small region where many species are distributed in most sites, species abundance becomes the main source of between-site variation and switching to presence-absence data will have a more serious impact (Austin & Greig-Smith, 1968). These findings suggest that future monitoring approaches aiming at analysis of species composition could focus on achieving a large sample size and reduce the sampling effort on each plot by estimating species abundance rather than counting individuals. This could allow a broader range of environments and/or management practices to be considered for the same sampling effort.
It is interesting to note that whilst the trend in the results observed for the UK dataset and the European dataset were similar in our UK case study, the magnitude of the abundance and species richness metrics diverged between the datasets. For both metrics, the absolute values of predictions for herbicide-treated and untreated plots in the UK were very similar whilst at the European scale values were high in the untreated plots and much lower in the plots which had received herbicide. This exemplifies some of the key challenges described in Table 1. For the UK data, the herbicide-treated and untreated data come from the same plots which were sampled before and after herbicide treatment and so we would expect the weed communities to remain similar with a loss of some individuals and species following treatment. However, in the rest of the AWME database the treated and untreated plots may come from vastly different surveys with different plot sizes and or different methodologies. It is also likely that the choice of the data collector to conduct a survey with or without herbicide treatment reflects the typical agronomy of the field, farm or region being surveyed. As such, survey data from sites with low or no herbicide use could be expected to have higher weed species richness and abundance than those where herbicide use is common (Hyvönen & Salonen, 2002). The choice to rescale the data within each dataset led to information loss on the absolute size of effect, however, it allowed us to consider all datasets with information on our variables of interest and the step-wise analysis confirmed that this technique was effective in allowing us to understand the relative effects of our explanatory variables.
In all three case studies, we found some consistency in the conclusions that could be drawn at the national scale and at the European scale, particularly in terms of the role of management practices. In fact in all three cases, we saw an increase in the certainty of predictions or in the explanatory power of the fitted model. This is primarily due to the additional data incorporated into the analysis, but it also indicates that the patterns observed at the national scale are largely supported by the additional datasets as any contradictory data would likely weaken the strength of the results.
Where we observed differences in the results between the national scale and the continental scale was largely in the role of the environ-  (Fried et al., 2008). Soil pH is recognised as one of the most structuring factors clearly differentiating acidophilic and basophilic weed assemblages (Hüppe & Hofmeister, 1990;Pinke et al., 2010). The soil effect seems less visible at the European scale may be due to a stronger differentiation of weed communities along longitudinal and temperature gradients.
Interestingly, in the FR case study, sampling season was the second most important gross effect at both the national and European scale. Therefore, the importance of the crop type variable is partly supported by warm continental countries (Pannonian plain of Hungary) or Mediterranean countries (Italy, Southern France) where the growing season is long enough to grow summer crops (Čarni et al., 2011). In northern Europe, the growing season is shorter and the differences between weed communities of winter and spring crops are less important. This difference could explain previous discrepancies between studies when reporting the relative importance of management (Hallgren et al., 1999) versus environmental factors (Lososová et al., 2004).
Beyond the case studies presented here, the large geographical extent covered by AWME makes it a particularly valuable resource to address questions where (i) a sufficiently large environmental gradient may not be present in national or regional surveys, like macroecological patterns or climate change effect prediction (space-for-time substitution) or (ii) an effect may be hypothesised to change with scale. Potential future studies using AWME could (a) test the abundant-centre hypothesis (Sagarin & Gaines, 2002) which assumes that a species becomes more abundant at the centre of its range, where the environmental conditions are most favourable, (b) test the theory of species assembly (Booth & Swanton, 2002) according to hierarchical filters starting from a true regional pool, (c) predict responses to climate change and/or extreme events using a time-for-space substitution, (d) predicting the spread of invasive or troublesome weeds by identifying combinations of management practices, soil and climates suitable for these species to establish, (e) disentangling the effects of management systems and single management measures and (f) examining relationships between weed management, weed abundance/weed pressure and weed diversity.
We have demonstrated the utility of gathering weed survey datasets into a European-scale database and explored the challenges and opportunities that such a database presents. We demonstrated that the European scale allows us to confirm and enrich previous works.
Our case study results should encourage us to use existing datasets to tackle more ambitious issues which require the perspective of a larger geographical area or a range of spatial scales as these can be combined with minimal loss of information despite different methodologies. The AWME database is a growing collection, and we welcome new data contributions and requests for data for analysis.