An updated LSU database and pipeline for environmental DNA identification of arbuscular mycorrhizal fungi

Recent work established a backbone reference tree and phylogenetic placement pipeline for identification of arbuscular mycorrhizal fungal (AMF) large subunit (LSU) rDNA environmental sequences. Our previously published pipeline allowed any environmental sequence to be identified as putative AMF or within one of the major families. Despite this contribution, difficulties in implementation of the pipeline remain. Here, we present an updated database and pipeline with (1) an expanded backbone tree to include four newly described genera and (2) several changes to improve ease and consistency of implementation. In particular, packages required for the pipeline are now installed as a single folder (conda environment) and the pipeline has been tested across three university computing clusters. This updated backbone tree and pipeline will enable broadened adoption by the community, advancing our understanding of these ubiquitous and ecologically important fungi. Supplementary Information The online version contains supplementary material available at 10.1007/s00572-024-01159-3.


Introduction
The use of environmental DNA and amplicon sequencing has become a fundamental tool for fungal ecologists (Tedersoo et al. 2022), and particularly for those studying mycorrhizal fungi.These approaches enable researchers to obtain a relatively unbiased snapshot of the community using a very small amount of starting material (e.g., soil or roots).In theory, this should generate large amounts of data to understand the biogeography and ecology of this important group of fungi.However, the most common type of mycorrhizal fungi -the arbuscular mycorrhizal fungi (AMF, phylum Glomeromycota) -are not well represented when using the "universal" fungal primers targeting the internal transcribed spacer of the rDNA gene (Schoch et al. 2012;Lekberg et al. 2018;Tedersoo et al. 2022).Furthermore, AMF sequence databases are geographically biased and incomplete (Bidartondo et al., Öpik et al. 2010;Stockinger et al. 2010).Therefore, the full potential of this technology for AMF requires use of other gene regions and phylogenetic placement approaches.
Although there is no consensus on the ideal region to target or sequencing platforms to use for AMF, the large subunit (LSU) amplicon (short read) sequencing is a good choice for next generation sequencing (Illumina;Stockinger et al. 2010;Delavaux et al. 2021, Delavaux et al. 2022).This region allows phylogenetic placement and subsequent discovery of novel taxa, while showing good taxonomic resolution at the species level (Krüger et al. 2009;Hart et al. 2015;House et al. 2016).Previous work established a curated reference tree and developed a pipeline for phylogenetic placement of AMF for the LSU region (Delavaux et al. 2021(Delavaux et al. , 2022)).Specifically, this work enabled phylogenetic placement of any environmental sequence into the backbone reference tree.Using these tools, environmental sequences can be placed into the Glomeromycota phylum -describing putative AMF -and further into the major 11 families.Importantly, the concerted effort packaging this process into a formal pipeline improved accessibility of this phylogenetic approach for interested scientists.
Despite the benefit to the research community brought by the new pipeline, users unfamiliar with command line interfaces and with low computing cluster support struggled with implementation, resulting in a substantial hurdle to the systematic adoption of this pipeline for those using the LSU.The main issues with implementation are trouble installing packages as required by the pipeline (specific versions and packages that require a conda environment, Anaconda 2024) and cluster specific assumptions in bash job scripts.
Here, we update the database and pipeline by (1) expanding the backbone tree to include newly described genera and (2) further improve usability of this pipeline, including creating a conda requirements package hosting all required packages and testing the pipeline across three university clusters.This updated backbone reference tree and user-friendly pipeline will contribute to broadened adoption of this tool, ultimately improving the scientific community's understanding of the ecology of these important fungi (Table 1).

One installation for the entire pipeline
Although previous formalizing of the database and pipeline helped improve accessibility to phylogenetic placement for the LSU region, even this streamlined version may be intimidating to those who are unfamiliar with SLURM (Simple Linux Utility for Research Management) cluster environments and the linux command line.Therefore, here, we provide a new simple installation mechanism (conda requirements file, https://github.com/conda/conda)that requires only one installation per user.This substantially reduces the installation burden as compared to the previous pipeline, which required five program and six R package installations.To allow for this single installation containing all required programs and packages, we begin from a stable base of qiime2 2024.2n(Bolyen et al. 2019) to which we add a single R package (treetools; Smith 2019).Not only does this approach require a single installation, but the required programs and packages remain unchanged, with versions pre-set.This reduces any unforeseen version incompatibility within the pipeline.In addition, we also replaced a dependency external to qiime2 (fastqc; Andrews 2010) for read quality visualization.Instead of relying on fastqc, we now use qiime2's built in visualization commands paired with qiime's online visualization tool (view.qiime2.org).

Major improvement Explanation 1. Backbone tree updated
Four new genera were added to the tree to reflect taxonomic discoveries.

Condensed ReadMe
The ReadMe has been condensed to concisely, but clearly, outline steps in the pipeline.

Simplified conda installation A conda requirements file is installed once and includes all required
programs and R packages in specified versions.

Cross computing cluster compatibility
Several changes have been made to the scripts to be compatible with the three SLURM clusters environments tested.

Public repository tracking changes
Our Github repository has all implemented changes tracked; we will continue to update any changes here.

Table 1 Improvements to the pipeline
Put simply, this approach allows users to install one 'meta package' or folder (the conda environment) containing all packages in the exact versions required for the pipeline to run; when running the pipeline, the user simply loads this environment and can be confident that all packages and commands are exactly as intended.This new single installation will not only make it easier for all users, but especially allow for less experienced users with low cluster support to access and benefit from the pipeline.

Ground truthing ease of implementation across three computing clusters
Previous development and testing of the pipeline was conducted on only one computing cluster at the University of Kansas, raising concerns of transferability of the pipeline across different SLURM environments.Importantly, small differences in cluster setup can lead to script failure, presenting an unnecessary difficulty for those implementing the pipeline for their own research.Therefore, this current pipeline was tested and adapted to work consistently across three different university computing clusters.Specifically, we have tested the pipeline using ten samples of data from tallgrass prairie soil samples on the (1) University of Kansas, (2) University of Colorado Boulder (Alpine), and (3) ETH Zurich (Euler) computing clusters.The ETH Zurich cluster does not support or install conda environments for users at this time, making it the ideal test for the newly integrated conda environment.The successful implementation of the pipeline on a cluster with no conda support gives us confidence that it should function smoothly at other clusters.We retain the ability to choose ASV or OTU outputs -via two versions of the pipeline -and ensure both pipelines have consistent naming and updates.Another change we implemented was explicitly directing temporary files, as this was causing issues on certain clusters (due to cluster level default differences).All temporary files should now consistently write to the main project folder, and be removed upon completion, independent of cluster configuration.Together, this testing across two continents and three

Conclusions
We present an updated user-friendly database and pipeline for environmental placement of arbuscular mycorrhizal fungi using the LSU rDNA gene region.Our major improvements include (1) providing an expanded backbone reference tree to reflect recently described taxa (2) streamlining the requirements for users to implement the pipeline on a cluster relying on the most common workload manager (SLURM) and (3) minimizing likelihood of cluster specific issues by testing the pipeline with the same dataset across three computing clusters.These improvements may further increase the utility of the LSU region for identification of AMF and AMF families of known and previously undescribed AMF.