Data analysis of polygalacturonase inhibiting proteins (PGIPs) from agriculturally important proteomes

The plant cell wall structure can be altered by pathogen-secreted polygalacturonases (PGs) that cleave the α-(1→4) linkages occurring between D-galacturonic acid residues in homogalacturonan. The activity of the PGs leads to cell wall maceration, facilitating infection. Plant PG inhibiting proteins (PGIPs) impede pathogen PGs, impairing infection and leading to the ability of the plant to resist infection. Analyses show the Glycine max PGIP11 (GmPGIP11) is expressed within a root cell that is parasitized by the pathogenic nematode Heterodera glycines, the soybean cyst nematode (SCN), but while undergoing a defence response that leads to its demise. Transgenic experiments show GmPGIP11 overexpression leads to a successful defence response, while the overexpression of a related G. max PGIP, GmPGIP1 does not, indicating a level of specificity. The analyses presented here have identified PGIPs from 51 additional studied proteomes, many of agricultural importance. The analyses include the computational identification of signal peptides and their cleavage sites, O-, and N-glycosylation. Artificial intelligence analyses determine the location where the processed protein localize. The identified PGIPs are presented as a tool base from which functional transgenics can be performed to determine whether they may have a role in plant-pathogen interactions.


a b s t r a c t
The plant cell wall structure can be altered by pathogensecreted polygalacturonases (PGs) that cleave the α- (1→ 4) linkages occurring between D-galacturonic acid residues in homogalacturonan.The activity of the PGs leads to cell wall maceration, facilitating infection.Plant PG inhibiting proteins (PGIPs) impede pathogen PGs, impairing infection and leading to the ability of the plant to resist infection.Analyses show the Glycine max PGIP11 ( GmPGIP11 ) is expressed within a root cell that is parasitized by the pathogenic nematode

Keywords: Plant interactions Polygalacturonase inhibiting protein (PGIP) Soybean Heterodera glycines Beta vulgaris Sugar beet
Heterodera glycines , the soybean cyst nematode (SCN), but while undergoing a defence response that leads to its demise.Transgenic experiments show GmPGIP11 overexpression leads to a successful defence response, while the overexpression of a related G. max PGIP, GmPGIP1 does not, indicating a level of specificity.The analyses presented here have identified PGIPs from 51 additional studied proteomes, many of agricultural importance.The analyses include the computational identification of signal peptides and their cleavage sites, O -, and N -glycosylation.Artificial intelligence analyses determine the location where the processed protein localize.The identified PGIPs are presented as a tool base from which functional transgenics can be performed to determine whether they may have a role in plant-pathogen interactions.
Published by Elsevier Inc.The duplicate PGIPs then are removed in Excel.The analysis results in a list of PGIP proteins that include the products of alternate splicing so the numbers in some cases are higher than the numbers of genes in some genomes.

Signal peptide prediction
Signal peptide prediction is done using SignalP 6.0.The default parameters are used.

O -glycosylation determination
O -glycosylation is determined using NetOGlyc -4.0.The parameters are set on default.N -glycosylation determination N -glycosylation is determined using NetNGlyc -1.0.The parameters are on set default.

Protein alignment
Protein alignment is performed using CLUSTAL Omega, CLUSTAL O(1.2.4) multiple sequence alignment.The analysis is performed using default parameters.

Artificial intelligence
Prediction of eukaryotic protein subcellular localization using deep learning is done using DeepLoc-1.0 in default settings.

Value of the Data
• Why are these data valuable?
Plants have a 2-tiered defense platform allowing them to defend themselves from pathogens [1] .The plant recognizes epitopes produced directly or indirectly as a consequence of the plantpathogen interaction [1] .The epitopes are collectively called pathogen activated molecular patterns (PAMPs) acting within a 2-tiered defense system involving PAMP (pattern) triggered immunity (PTI) and effector triggered immunity (ETI) [1] .
Plant cell walls are an important barrier to pathogen infection.Up to 60% of the cell wall pectic moieties of dicot and nongraminaceous monocot primary cell walls are homogalacturonans (HGs), the major component of the middle lamella [2] .Pathogen polygalacturonases (PGs) are effective in facilitating pathogenicity because they break down cell wall polymers, permitting infection [3] .The study presented here is valuable to those interested in understanding plant defence, the evolutionary processes behind defence processes, cell signalling, and an understanding of basic cellular processes.
• Who can benefit from these data?
In order to impede pathogen PGs, plants secrete polygalacturonase inhibiting proteins (PGIPs).PGIPs have a bimodal function.Firstly, PGIPs directly inhibit PGs.Secondly, PG activity leads to oligogalacturonide (OG) accumulation, eliciting a defence response [4] .Therefore, PGIPs deactivate the pathogen effector while also leading to the production and amplification of a signalling cascade.This signal cascade further impairs the pathogen, leading to their demise.For example, a Beta vulgaris (sugar beet) PGIP , when expressed in Nicotiana benthamiana , limits the pathogenicity of Rhizoctonia solani , Fusarium solani, and Botrytis cinerea whose pathogenicity is normally driven by their PGs [5] .Previous work on G. max PGIP s ( GmPGIP s) have functionally examined them [6] , benefitting stakeholders interested in the development of pathogen-resistant crops, including Beta vulgaris ssp.vulgaris (sugar beet).Novel signalling events can also be determined through the study presented here.
• How can these data be reused by other researchers?Using 11 G. max PGIP protein sequences, the analysis presented here extracts the PGIPs that exist in 51 additional genomes of other important crops and other flowering plants.Analyses determine whether the 469 proteins have signal sequences, compatible with them being secreted proteins, a cleavage site, and whether they are O -and/or N -glycosylated.Artificial Intelligence analyses show which cellular locale the proteins can be expected to exist, complementing recent transgenic studies of the GmPGIP11 .The provided analysis and accompanying data can be re-used to basic aspects of plant cell biology and generate pathogen-resistance in a wide spectrum of agriculturally-important crops.The evolution of defence and signalling processes can also be examined.

Data Description
A total of 469 proteins obtained from Phytozome, not including G. max , are analysed, spanning 51 proteomes ( Table 1 , Supplemental Data File 1) [7] .The proteins annotated as probable PGIPs pass a cutoff between 300 and 399 AAs, within the range of known PGIPs.Among them, 394 putative PGIPs are between 300 and 399 AAs (84%).Among the 51 proteomes, 45 (9.6%) are shorter than 300 AAs.Furthermore, 30 (6.4%) proteins annotated as PGIPs are identified as being 400 AAs or larger.LRRS have a low overall homology based on the LRR composition.For example, Bv PGIP6 (EL10Ac4g07809.1) is annotated as being 1,383 AAs.When a BLASTP analysis is run, it is shown to be homologous to the 1,249 AA GASSHO1 (GSO1) (OAO97463.1)as well as the 332 AA polygalacturonase inhibiting protein 1 (AAM65836.1).A re-annotation of the PGIPs is beyond the scope of the study.

Signal peptide prediction
Signal peptide prediction is performed to determine whether the identified 469 proteins exhibiting homology to PGIP have characteristics of secreted proteins (Supplemental Data File 2).The protein sequences are imported into SignalP 6.0 [ 8 , 9 ].The number of putative PGIPs with predicted signal peptides are identified ( Table 2 ; Supplemental Data File 3).

Comparison of O -and N -glycosylation of Gm PGIPs
A companion analysis demonstrates that Gm PGIP11 but not Gm PGIP1 functions in the defence response that G. max has toward H. glycines parasitism.A comparative analysis of G. max PGIPs is undertaken to determine whether O -and/or N -glycosylation could be correlated to these differences.The O -glycosylation analysis demonstrates that while Gm PGIP1 is O -glycosylated, Gm PGIP11 is not ( Table 3 ; Supplemental Data File 4).
In contrast to the above-presented findings both Gm PGIP1 and Gm PGIP11 are predicted to be N -glycosylated.However, some of their predicted N-glycosylation sites are not at homologous aa positions ( Fig. 1 ; Supplemental Data File 5).For example, the NPTT site found in Gm PGIP1 and starting at aa position 41 is not identified in Gm PGIP11 (Fig. ).In contrast, a NLSG site found in Gm PGIP11 and starting at aa position 101 is not found in Gm PGIP1.Similarly, an NLSG predicted N -glycosylation site found in Gm PGIP11 and starting at aa position 174 is not found in Gm PGIP1 ( Fig. 1 ).Furthermore, an NKTT predicted N -glycosylation site found in Gm PGIP11 and starting at aa position 258 is not found in Gm PGIP1 ( Fig. 1 ).However, N -glycosylation sites that are in homologous positions between Gm PGIP1 and Gm PGIP11 do exist ( Fig. 1 ).Gm PGIP1 has a NVSG predicted N -glycosylation site starting at aa position 132 while Gm PGIP11 has a NVSG predicted N -glycosylation site starting at aa position 150 ( Fig. 1 ).Consequently, while experimentation has not proven that these sites are important to the functional differences occurring between Gm PGIP1 and Gm PGIP11, they are different and provide a basis for future experimentation.

Artificial intelligence
The 469 identified PGIP proteins spanning the 51 genomes are assessed by artificial intelligence analyses to produce a sequence position file (Supplemental Data File 6).A second file generates a map to the cellular destination where the predicted protein is predicted to function ( Table 3 ; Supplemental Data File 7).An example for Beta vulgaris Bv PGIP4, shown to function in defence to various pathogens in N. tabacum , is presented ( Fig. 2 ) [5] .
The secretion of plant proteins is an important cellular property used for a variety of processes including development and disease resistance [ 10 ].The data presented here is computational support showing PGIPs identified as belonging to taxa positioned at the base of angiosperm evolution are predicted to have signal peptides, have O-and/or N -glycosylation, and undergo secretion into the apoplast.Further assessment identifies PGIPs from both monocot and dicot lineages with predicted signal peptides and the subcellular or supracellular compartment to which they are targeted [ 5 , 11 ].

Analysed proteomes
The study analyses the proteomes of 51 plants not including G. max , many important to agriculture.The 51 proteomes span the base of angiosperm evolution ( A. trichopoda ), a monotypic genus of Amborellaceae and the only member of the Amborellales that has 2 predicted PGIP proteins [ 12 ].Each PGIP is predicted to have signal peptides, experience O -and N -glycosylation, and undergo secretion into the apoplast.The monocots presented here are represented by A. comosus, D. alata, M. acuminata, H. vulgare, O. sativa, T. aestivum, B. distachyon, M. sinensis, S. bicolor, Z. mays and P. hallii with the remaining plants belonging to the Eudicots.All of the studied species have at least one putative PGIP that is predicted to have a signal peptide, have O -and/or N -glycosylation, and are secreted into the apoplast.
Local duplication of plant genes, including PGIPs, results in the generation of genes whose protein products perform an important function in defence [6] .The PGIP proteins identified here also appear to be products of localized gene duplications.Consequently, the identified genes may relate to the birth and death model for PGIPs that is proposed [6] .Possible localized gene duplication is identified from the analysis of the 51 proteomes.Based off the annotations, the analysis, identifying direct tandem duplications for at least one PGIP gene duplication in 29 of the 51 proteomes including A. hypochondriacus, B. vulgaris, C. quinoa, C. arabica, D. carota, M. guttatus, O. europaea, E. grandis, C. arietinum, M. domestica, M. truncatula, P. vulgaris, P. persica,   show: 100 allowing for gaps and filter query, in order that they appear on the BLAST program.Through these analyses it is possible to extract the genomic DNA, transcript, cDNA, protein accessions, their sequences, and gene family members.The analyses also permit the extraction of protein homologs and splice variants from the selected agricultural crops of international importance, those with importance in the U.S., and those important biologically according to [8] .
The identified PGIP proteins are compiled using a Bitscore of 140 as a cutoff.To identify the PGIP proteins, each of the 11 G. max PGIP protein sequences are queried into the studied proteomes.The individual queries for Gm PGIP1 through Gm PGIP11 are stored in individual tabs in Excel.Then, the PGIPs that have Bitscores of 140 or higher are compiled for all of the queries for the individual Gm PGIPs.The duplicate PGIPs then are removed in Excel.The analysis results in a list of PGIP proteins that include the products of alternate splicing so the numbers in some cases are higher than the numbers of genes in some genomes.

Signal peptide prediction
Signal peptide prediction is done using SignalP 6.0 [9] .SignalP 6.0 is based on protein language models (LMs).The models use information from millions of unannotated protein sequence which are been analysed across all life domains.LMs create logical protein representations capturing their biological structure and properties.SignalP 6.0, thus, predicts additional SP types not possible in earlier iterations of SignalP (e.g., SignalP 5.0) and better extrapolates them to distantly related proteins and ones used to create the model and metagenomic data of unknown origin.SignalP 6.0 also identifies SP subregions.The default parameters are used.

O -glycosylation determination
O -glycosylation is determined using NetOGlyc -4.0 [ 17 ].The parameters are set on default.The output format is imported into Excel.

N -glycosylation determination
N -glycosylation is determined using NetNGlyc -1.0 set [18] .The parameters are on set default.The output format is imported into Excel.

Protein alignment
Protein alignment is performed using CLUSTAL Omega, CLUSTAL O(1.2.4) multiple sequence alignment [19] .The analysis is performed using default parameters.The output file is imported into MS Word.

Artificial intelligence
Prediction of eukaryotic protein subcellular localization using deep learning is done using DeepLoc-1.0[20] .The DeepLoc-1.0 analysis determines the importance of a particular amino acid along a protein chain that is relevant for prediction (attention) of its subcellular location and is done in default settings.DeepLoc-1.0then predicts the subcellular localization of eukaryotic proteins, differentiating between 10 different localizations including the nucleus, cytoplasm, extracellular, mitochondrion, cell membrane, endoplasmic reticulum, (ER) chloroplast, Golgi apparatus, lysosome/vacuole, and peroxisome and is done in default settings.The output of the analysis is presented as a graphic that shows the relative importance of each AA along the polypeptide chain as well as a hierarchical tree that shows where the protein is expected to be located withing a cell [20] .

Fig. 2 .
Fig. 2.Predicted O -and N -glycosylation sites of G. max PGIP proteins.Cyan, predicted O -glycosylation site.Magenta, N -glycosylation site.Yellow, an aa that overlaps between two predicted N -glycosylation sites.Blue, an aa that overlaps between an O -and N -glycosylation site.Gray, possible mis-annotated N-terminal sequence.
This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) The 11 G. max PGIP protein sequences are used in Basic Local Alignment Search Tool program (BLAST) searches of the proteomes (BLASTP) using the default parameters at Phytozome ( http://www.phytozome.net/ ).The identified PGIP proteins are compiled using a Bitscore of 140 as a cutoff.To identify the PGIP proteins, each of the 11 G. max PGIP protein sequences are queried into the studied proteomes.The individual queries for Gm PGIP1 through Gm PGIP11 are stored in individual tabs in Excel.Then, the PGIPs that have Bitscores of 140 or higher are compiled for all of the queries for the individual Gm PGIPs.

Table 1
The proteomes under study.

Table 3
The signal peptide prediction, O -, N -glycosylation prediction, cellular location prediction.
( continued on next page )