pygen-structures: A Python package to generate 3D molecular structures for simulations using the CHARMM forcefield

Forcefields used in molecular dynamics (MD) or Monte Carlo simulations (such as the CHARMM forcefield, Huang & MacKerell (2013)) require a great deal of information in order to ensure that the correct harmonic force constants are chosen for particular bonds or angles. As such, protein structure files (PSF files), containing information about connectivity and relevant parameters, are often used by simulation packages to contain this information. These files are typically generated using psfgen, (Ribeiro et al., 2019) a Tcl package bundled with VMD (Humphrey, Dalke, & Schulten, 1996) which requires a fully annotated Protein Data Bank (PDB) format file (wwPDB, 2012) with correct names for atoms and residues, as defined in the forcefield. This annotation is challenging to do in an automated fashion. For structures from the Protein Data Bank itself, this is expected to some degree, as experimentally resolved structures are often missing the coordinates of certain residues (Brünger & Nilges, 1993) and can contain bound ligands which may not be parameterised in the forcefield. At present, the normal solution to this problem is to download a structure from the database, manually fix residue and atom names, and generate a PSF file using psfgen. In other words, obtaining a 3D structure, annotating the structure and generating an input file in the correct format for a simulation package are treated as separate tasks.


Statement of Need
Forcefields used in molecular dynamics (MD) or Monte Carlo simulations (such as the CHARMM forcefield, Huang & MacKerell (2013)) require a great deal of information in order to ensure that the correct harmonic force constants are chosen for particular bonds or angles. As such, protein structure files (PSF files), containing information about connectivity and relevant parameters, are often used by simulation packages to contain this information. These files are typically generated using psfgen, (Ribeiro et al., 2019) a Tcl package bundled with VMD (Humphrey, Dalke, & Schulten, 1996) which requires a fully annotated Protein Data Bank (PDB) format file (wwPDB, 2012) with correct names for atoms and residues, as defined in the forcefield. This annotation is challenging to do in an automated fashion. For structures from the Protein Data Bank itself, this is expected to some degree, as experimentally resolved structures are often missing the coordinates of certain residues (Brünger & Nilges, 1993) and can contain bound ligands which may not be parameterised in the forcefield. At present, the normal solution to this problem is to download a structure from the database, manually fix residue and atom names, and generate a PSF file using psfgen. In other words, obtaining a 3D structure, annotating the structure and generating an input file in the correct format for a simulation package are treated as separate tasks.
For simulations involving combinatorial searches of smaller sequences, however, the level of human intervention required can be far greater. There may be no experimentally determined structures for a given sequence, so these must be generated. Existing structure generation tools must be applicable to more general chemistry than molecular dynamics forcefields, and as a result are far less specific about the identities of particular atoms. As such, these packages often leave residue and atom names labelled as unknowns or element symbols. Additionally, if structures are generated based only on connectivity or drawn by the user, it can be challenging to ensure that tetrahedral chirality (the handedness of carbon centres with four different substituents) is properly respected. A structure generation method for small molecules which is aware of the forcefield is necessary in order to automate this process. The workflow which pygen-structures hopes to automate is shown in Figure 1. As with all structure generation methods, applicability is limited to cases where secondary structure is unimportant. Generation of 3D coordinates is currently limited to structures with around 15 protein residues.

Summary
pygen-structures is an open source (3-clause BSD license) Python library to generate 3 dimensional structures for molecules from the CHARMM forcefield. Coordinates are generated using an embedding method based on empirical data (Riniker & Landrum (2015), as implemented in the RDKit, The RDKit Contributors (n.d.)), and are written out as standard PSF and PDB files. The package contains convenience functions for generating molecules from a collection of CHARMM residues and patches, or from one letter amino acid codes (usable by calling pygen-structures as a command line application) and a series of classes and functions for representing and manipulating CHARMM data. The chirality of tetrahedral centres is set using internal coordinate data from the residue topology file.
Classes provided by pygen-structures include representations of CHARMM residue topol-ogy files, which contain information about residues, their atoms and their connectivity; and parameter files, containing the relevant force constants involving sets of atoms. There are also classes which describe residues and patches from topology files, and representations of atoms and molecules. Residues from the forcefield (and molecules created from those residues) can be represented as RDKit Mols, enabling pattern matching of residues to molecules.

Demonstrating the Benefits of pygen-structures
To illustrate the current necessity of human involvement in annotation of small-molecule structures, it is useful to examine structures which are created by other software packages. The following is a PDB file generated by drawing the structure of phenylalanine in MarvinSketch 20.3 and cleaning the structure in 3D. This is representative of many chemical drawing packages, which aim to work with as many areas of general chemistry as possible. Similar behaviour is observed in PerkinElmer's ChemOffice suite (version 19), (PerkinElmer Informatics, 2019) which does not write out residue names upon PDB export. Avogadro 1.2.0 (Hanwell et al., 2012) and PyMol 2.3.0 (Schrödinger LLC, 2019) cope better with biological molecules, and annotate the residue and atom names for protein residues correctly according to the PDB standard, but fail to differentiate between D and L isomers. The level of annotation is greater, but hydrogen atom positions are not added (were these to be added using Chem.AddHs, they would be incorrectly annotated) and the atom names do not match the CHARMM atom names exactly (though they do follow the standard PDB naming conventions, so it would be trivial to correct them). psfgen is capable of assigning the positions of hydrogen atoms and setting the correct charge states, so this approach works reasonably well for simple L-amino or D-amino sequences. It is, however, nontrivial to generate mixtures of D and L amino acids, as acid chirality is set using a flag argument to Chem.MolFromSequence. As with Avogadro and PyMol, D-amino acids are labelled in the same way as the corresponding L-amino acid.
Hesketh, T., (2020). pygen-structures: A Python package to generate 3D molecular structures for simulations using the CHARMM forcefield. Journal of Open Source Software, 5(48), 2157. https://doi.org/10.21105/joss.02157 With pygen-structures, it is easily possible to generate the structures of free amino acids, peptides, or more complex structures such as glycans and glycopeptides, in an automated fashion. A significant advantage is that it is no longer necessary to use psfgen to generate the PSF file. This greatly simplifies combinatorial searches of small molecule sequences. A downside to this approach is that knowledge of the available CHARMM residues and the required patches is necessary, but this is true of psfgen as well. Some initial work is always required to discover whether CHARMM can represent a molecule of interest. This could be improved through use of automated pattern matching from provided structures, but is challenging due to the incomplete nature of X-ray crystal data.

Future Work
pygen-structures is not yet a fully featured replacement for psfgen. In particular, the present structure generation method fails to fill coordinates for missing residues or missing atoms of large structures (making it impractical to use with protein structures from the PDB).
In future, the aim is to support all of psfgen's current functionality. Further advantageous features would include the generation of simulation boxes containing water, the ability to handle multiple molecules in a single system, and an automatic bead-generating method for coarse-grained simulations.
Further documentation of the polymeric structures (and relevant linkages) which are already parameterised in the CHARMM forcefield would be advantageous to encourage further combinatorial work.
Unit and integration tests (using pytest, Krekel et al. (2004)) are supplied to test the functionality provided by the package. Tests can be run using pytest --pyargs pygen_struc tures. Tests rely upon OpenMM (Eastman et al., 2017) as an additional dependency. Python 3.6 and 3.7 are supported, and pip can be used for installation: pip install pyge n-structures.