Protocol to analyze the bacterial pangenome using PAN2HGENE software

Summary The PAN2HGENE is a computational tool that enables two main analyses. First, the tool can identify gene products absent from the original prokaryotic genome sequence. Second, it enables automated comparative analysis for both complete and draft genomes. All analyses are performed through a simple and intuitive graphical user interface without the need for extensive and complex command lines. For complete details on the use and execution of this protocol, please refer to Silva de Oliveira (2021).


MATERIALS AND EQUIPMENT
This protocol was created using a desktop with a fourth-generation Core i5 processor, 16 GB of RAM, and the Ubuntu 20.04 operating system. PAN2HGENE has also been successfully tested on Debian and Mint operating systems. So, if the user wanted to reproduce the protocol in the Debian distribution, it will be necessary to log in as root and then remove the sudo command from the beginning of the execution of each protocol command.

STEP-BY-STEP METHOD DETAILS
Step 1: install dependencies Timing: 1 h The complete installation of PAN2HGENE starts with the installation of the system dependencies and then continues with the installation of other programs that are part of the pipeline.
1. To start the installation run the commands in the Box 1. Below to update the system. CRITICAL: If you face any problem with the commands of the Box 1, we suggest update your system with the commands 'sudo apt-get update' and 'sudo apt-get upgrade'. After this try to run the Box 1 again.
2. Then we will install some packages and programs with the commands below.
Box 1 sudo apt-get install make sudo apt-get install build-essential sudo apt-get install curl
a. At the end of the installation, the user must move the SPAdes folder to the /opt directory.
b. To validate the installation, run the command (see Box 5), shown in Figure 1.
a. The following step is the configuration of the PGAP script (/opt/PGAP/PGAP.pl), it can be done using gedit or other text editor preferred by the user.
b. Modify the path of the software leaving the lines of the file the same as the lines without comments (lines that do not start with the # character). It is necessary to adjust the execution path of the programs e.g., /usr/bin/formatdb according to the path where the software is in your operating system. In Box 8 below, the lines that do not start with the character # demonstrate the configuration performed in the Ubuntu operating system. system("perl ./multiparanoid.pl -species ".join(".pep+",@species).".pep -unique 1"); system("perl ./Blast_Filter.pl All.blastp All.pep $coverage $identity $score | $mcl --abc -I 2.0 -o All.cluster"); pep"); Box 10 system("perl /opt/PGAP/multiparanoid.pl -species ".join(".pep+",@species).".pep -unique 1"); system("perl /opt/PGAP/Blast_Filter.pl All.blastp All.pep $coverage $identity $score | $mcl --abc -I 2.0 -o All.cluster"); a. To verify that Prokka was installed correctly, run the command below and the result should be similar to this ( Figure 3).
6. R Installation. The R software is installed by following the commands below.
a. After installation, run R and install the libraries. a. And to test if everything is correct, run the command below and the result should be similar to this ( Figure 4).
8. MySQL Installation. To install MySQL server run the following commands.
a. MySQL will ask you to create a password for the root user. Enter the password and answer Y when asked.  Figure 5).

Timing: 1 h 27 min
After installing all dependencies, the user must download PAN2HGENE. To start using it, follow the steps below.
9. PAN2HGENE Download. The PAN2HGENE jar package is available at (https://sourceforge.net/ projects/pan2hgene-software/). Download the pan2hgenev2.0.jar and lib_v2.tar.xz files leaving both in the same directory. See the example below. The PAN2HGENE pipeline can be executed in three different ways, each one performing a specific analysis.  Note: In this example, product identification analysis is performed using a Bifidobacterium breve DSM20213 genome and paired reads from the Illumina HiSeq 2000. To start the Product Identification analysis, place the fasta genome and fastq reads in the same folder, follow the steps below.
a. If this is your first use, enter the root user and password and press the create DB button, else enter the root user and root password in the indicated fields and then click the Connect button, then click on the Next button ( Figure 6). b. On the following screen, it is necessary to enter the project name and select the type of analysis to be performed. In the following example, the name ''Test1'' and the Product Identification option were added, after that press the New button (Figure 7). c. Data input is done in the following window. Press the Browse button to select the FASTA file (Remember that fasta genome and fastq reads must be in the same folder). The reads files will be displayed below, select the appropriate reads for the organism, inform the type of reads and if it is paired, inform the order and orientation (Figure 8). d. Press Add Read button and confirmation message will be displayed (Figure 9). e. Note that the reads are now marked as used, repeat the same process if there are more genomes. Then click Next ( Figure 10). f. The screen below will be shown, here it is possible to modify the parameters of PAN2HGENE for Bowtie, Comparative Analysis, and Annotation process. In this case, we will use the default parameters. So just click the Save Data button, then click Next ( Figure 11). g. The screen below will be shown. To start the analysis, click on the Perform analysis button.
And in the Logs field, it is possible to check the analysis steps being performed. When the analysis is finished, the message Complete Analysis will appear in the Logs field, as can be seen on the screen below. Now close PAN2HGENE and go to the folder where the data was analyzed (Figure 12). h. Several files will be inside the folder, in addition to the fasta genome and the fastq reads used in the analysis. The result of the Product Identification analysis are the three files marked below, GenomeNameBlastResult_Products.fasta, GenomeNameBlastResult_report.pdf, and GenomeNameBlastResult_Report.txt ( Figure 13). 11. PAN2HGENE Comparative analysis. In this example, three fasta genomes, Bifidobacterium breve DSM20213 (complete), Bifidobacterium breve NCTC11815 (complete) and Bifidobacterium breve PRL2020 (draft with six contigs) were used. To start Comparative Analysis, place all fasta genomes in the same folder as shown below.
CRITICAL: The initial steps are the same as shown in Figures 6-12, with the exception of the input data window that changes for this specific analysis.
a. When informing the directory of the FASTA files, they will be listed as shown in the figure below ( Figure 14). b. The analysis results are organized in the pgfiles directory. The files that are the results of the Comparative Analysis are the files that start with the numbers 1, 2, 3, 4, 5 and the figures in PNG format (Figure 15).
12. PAN2HGENE Full pipeline. The Full Pipeline analysis performs Product Identification analysis and Comparative Analysis automatically and sequentially. Thus, the new gene products identified in the Product Identification step will be used in the Comparative Analysis step.
Note: Now follow the steps described previously in the item '10. PAN2HGENE Product Identification analysis', selecting Full Pipeline analysis instead of Product Identification (Figure 16). 13. Main graphical results. The graphs are produced by running the comparative analysis, so the creation is included in the processing time (Figures 17 and 18).

Timing: 2 min
Patric software has been integrated into the pipeline as an alternative to automatic annotation software. Thus, the user is free to choose between Patric or Prokka, for the annotation execution.
14. Your PAN2HGENE is now ready to use, If the user does not want to use Patric to make the annotation, it is not necessary to perform the following steps.
Note: However, the PAN2HGENE offers the option to perform all annotation analyses through PATRIC instead of Prokka (which is the default option). If you want to use PRATIC in the annotation process, follow the steps below. If you do not already have a PATRIC account, you will have to register on (https://patricbrc.org/). a. Install PATRIC Command Line Interface.
b. If you prefer you can also install PATRIC using the tool gdebi. c. Setting, copy the file ''p3-login.pl'', provided with the pan2hgene files, and replace it in the installation directory of Patric-cli.
d. When performing any of the PAN2HGENE analyses, it is possible to choose the Patric annotation instead of the Prokka annotation. Before saving the parameters, click on the PATRIC button (Figures 19 and 20).

EXPECTED OUTCOMES
Although there are several tools to perform the comparative analysis, PAN2HGENE stands out for its characteristics, presenting a simple graphical interface to facilitate the analysis, instead of complex command lines. This tool can perform the identification of possible new gene products in a genome and can also perform, unlike other tools, the comparative analysis using complete genomes and Finally, it is important to note that both analyzes are performed automatically. And that the input data are fasta genomes and their reads in fastq, without the need for the user to create standardized inputs or need to manipulate input files (PAN2HGENE already does this automatically too).

LIMITATIONS
PAN2HGENE is able to perform comparative analysis using complete genomes and draft genomes as input. The PAN2HGENE will perform an automatic annotation through the Prokka or Patric. At this point, it is important to emphasize that the annotation does not depend on the PAN2HGENE, as the pipeline only uses the annotation generated by Prokka or Patric.

Problem 1
The figures generated as a result do not present information on the pangenome distribution.

Potential solution
This usually happens when few genomes are used in comparative analysis. PAN2HGENE can perform comparative analysis for a minimum of 3 genomes. In general, when 3 or 4 genomes are used in the analysis, the pangenome result has practically no distribution information. Thus, the solution is to add a greater amount of genomes in the analysis, remembering that PAN2HGENE does not have a maximum limit of genomes that it can analyze.
Problem 2 PAN2HGENE never gets to the end of the analysis, it just keeps processing.

Potential solution
Comparative analysis is a type of exponential analysis, in which the more genomes are inserted into the analysis, the greater the computational cost to perform it. What could be happening is that the computer used is not able to process the analysis. So, the solution, in this case, would be to perform an analysis with fewer genomes, to ensure that PAN2HGENE is working correctly and then try to use a more powerful computer to perform the analysis with all the selected genomes. For example, using a desktop with a fourth-generation Core i5 processor and 16 GB of RAM, when performing the comparative analysis for 10 genomes, the complete run took approximately one and a half hours. And using the same computer to analyze 20 genomes, the complete runtime was approximately 4 h and 20 min.
Note: Problems 1 and 2 are related to the amount of genomes used and the hardware capacity used, respectively, and not to the PAN2HGENE software.

Lead contact
Further information and requests should be directed to the lead contact, Allan Veras (allanveras@ ufpa.br).

Data and code availability
The published article includes code generated and all used datasets are available at NCBI and described in the key resources table.

DECLARATION OF INTERESTS
The authors declare no competing interests.