Sdfconf: A Novel, Flexible, and Robust Molecular Data Management Tool

Projects in chemo- and bioinformatics often consist of scattered data in various types and are difficult to access in a meaningful way for efficient data analysis. Data is usually too diverse to be even manipulated effectively. Sdfconf is data manipulation and analysis software to address this problem in a logical and robust manner. Other software commonly used for such tasks are either not designed with molecular and/or conformational data in mind or provide only a narrow set of tasks to be accomplished. Furthermore, many tools are only available within commercial software packages. Sdfconf is a flexible, robust, and free-of-charge tool for linking data from various sources for meaningful and efficient manipulation and analysis of molecule data sets. Sdfconf packages molecular structures and metadata into a complete ensemble, from which one can access both the whole data set and individual molecules and/or conformations. In this software note, we offer some practical examples of the utilization of sdfconf.

Introduction sdfconf is used to do things that are otherwise difficult to do. The motivation for the sdfconf was the absence of software tools that could store and analyze multiconformational data for small molecules. Accordingly, the choice for the biologically relevant ligand conformation within its binding site in protein was based on single data, or command line argumentation.
sdfconf is a software package written in python. It's purpose is to enable easy, flexible and versatile way to handle, manipulate and analyze conformational data of molecules. It mainly works with SDfiles (.sdf), but some operations also work with .mol2 and .pdb files. In practice, any data stored into mol2 can be imported: Atom types, partial charges etc. Main feature of sdfconf is using mixed, molecule-or conformation-wise data from mixed sources to organize and select molecules. It can present results as SDfile, two flavours of .csv or simple counts of remaining molecules. It can also plot histograms in 1D and 2D (heatmap) or scatter plots.

How to use sdfconf
In general, sdfconf is rather simple and straightforward to use when considering the variety of possible operations. Generally complex results require complex commands. When using sdfconf, the default help offers useful explanations for different options. All functions of sdfconf can be divided to a few important categories. The categories and their respective options are: • input*: SDfile to manipulate/analyze with the software ○ input • augmentation: Addition of data from a number of sources ○ combine, allcombine, addcsv, addatomiccsv, getmol2 • calculation: Calculation of information from existing data ○ closestatoms, closeratoms, makenewmeta, addescape, addinside • conformations: manage conformation numbers ○ conftometa, conftoname, removeconfname, removeconfmeta, • manipulation: Changing and removing of data ○ nametometa, metatoname, changemeta, stripbutmeta, sortorder, removemeta, pickmeta • chopping: Using the extract option to remove undesired conformations ○ extract, allcut, cut • plotting: Plotting of histograms ○ Histogram, scatter • output type*: What to output? sdf, csv, conformation counts, metanames, nothing?
○ getcsv, getatomcsv, metalist, counts, donotprint • output*: Where to output? Overwrite, file, screen? ○ output, overwrite • output options: Make folders or multiple files? ○ split, makefolder • options: Various options on the run, like verbose, ignores and proportion. ○ verbose, ignores, proportion, config • injection: Inject data to external files ○ putmol2 These are just arbitrary groups of actions that aren't explicitly present in the program, but make the software more approachable. Note that not all runs have all of these, only those marked with * are present in every run in some form. Important and complex options are discussed in more depth in section Examples.

Installation
Currently, there are packages as rpm for linux and executable installer for windows. Universal python wheel distribution and source distribution are also available. To use rpm distributions, use package managers available for your linux distribution. For wheel distribution as well as installing source distribution, you need python setuptools. Wheel may be installed using python package installer pip. To install from source, after extracting the tar, change directory to package root and run python setup.py install. Note that there is no uninstaller for source installation. Using windows installer should be rather straightforward. After installing sdfconf, command "sdfconf" should be available. sdfconf may also be used without installing by running <package_folder>/bin/sdfconf sdfconf requires following packages to be installed: numpy (1.7.1 or later) and matplotlib (1.4.2 or later).
Metadata sdfconf can store, manipulate and utilize molecule-, conformation-and atom-wise data in a way referred as metadata. All stored metadata has a name and some sort of value. It can never be without value, if it is, it does not exist at all. Temporary metadata may exist without a name.
Datatype sdfconf supports three basic types of data: string, integer and float.

String
Metadata may include strings. They can basically include anything. Internal python type is "str"

Integer
Integer numbers. Whole numbers, positive and negative. Internal python type is "int" Float Floating point numbers. All real numbers, positive and negative. Internal python type is "float"

Datastructure
In plural datastructures all entries must have the same datatype. If a list includes [1, 2, 3.1], datatype is float. If a list includes [1, 3, oxygen], type is string. Casting to different types is done automatically.

Single
Singular data. One number or string. Effectively is a list with one item.

List
List of data. A number of data entries under one name. For example list of integers might express certain desired or undesired atom numbers.

Dictionary
Dictionary is a list of paired data. Every entry has a key and a value. For example atom number and partial charge of said atom. Internal python type is "collections.OrderedDict".

Metastatements
Many functions of sdfconf are based on so called metastatements. They are logic/arithmetic statements that are used to define properties for molecules and conformations. These are in many ways the point of sdfconf and are worth learning, despite being quite difficult at first. Experience in scripting languages like python or numeric calculation software like matlab may help.

Calling stored meta
One may refer to any metadata stored in molecule by simply it's name. For example if SDfile includes lines > <ID> benzene one would refer to said meta by typing "ID" and it would return singular string "benzene" in case of said conformation. On other molecules it would return the corresponding value.

Reserved names
All function names specified later in Functions section are reserved. Creation of such metadata is possible, but it's impossible to refer to them in metastatements. There are also certain names that are created by some routines of sdfconf, but they are just conflicting metanames.

(Partly )nonexistent names
If a query with metaname that doesn't exist on any or some conformations, nothing is returned for these conformations. Usually this leads to warnings about non-created metadata. If used in logical comparison, such values result in logical false result.

Names with operators
If metaname includes an arithmetic operator, sdfconf tries to interpret it. For example if a conformation has metas with names "a", "b" and "a-b", making a query "a-b" would actually result in "a"-"b". Operators may be escaped by adding a preceding backslash, or as in previous example "a\-b".

Literal values
Numbers in statements generate a metadata including that singular number. Strings may be generated by enclosing a string in quotes. More on quotes in section "Quotation marks".

Parenthesis
In following sections let's assume following metas: • meta1

Parenthesis ()
Parenthesis have four meanings in sdfconf. First one is to prioritize operations, second is to slice selected parts of dictionary type meta, third is to make logical type slices from any meta and fourth is as enclosure for function arguments.

Dictionary slice
It is possible to pick selected parts of dictionary by slicing it with explicit keys or meta of list type.

Argument enclosure
There is a number of builtin functions that may be used with syntax "func(meta)", for example "sum(meta2)" would yield 9, while "max(meta1)" would yield 5.5. More in section Functions.

Curly brackets {}
Curly brackets can be only used with dictionary style meta. They have two meanings.
Slicing dictionary keys First use for curly brackets is slicing keys from dictionaries. They work just like square brackets in slicing, but return dictionary keys instead of values.

Logical slicing by keys
Secondly curly braces can be used in logical slicing of dictionaries by keys. They work like parentheses in logical slicing, but comparisons are made against dictionary keys. For more about logical metaslicing, look section "Comparison, metaslicing".

Quotation marks '' ""
Queries inside quotation marks will always generate a new meta including a literal string inside the quotes.

Operators
Operators can be used to make combinations of different meta. With list type meta, it is only possible to make operations with other lists of same length or singular meta. Operations between applicable lists makes operations between single items of the lists, in the order of appearance. If one of the two meta is singular, it affects all elements of the list. When doing operations with dictionary type meta, only operations with singular data or other dictionary type meta are possible. With two dictionaries operations are performed between elements of each dictionary with same keys. Keys without a pair in other list are omitted. Keys of resulting meta are in the order of the first dictionary. If one of the meta is singular, it affects all values of the dictionary. Operations between singular meta are straightforward operations between two values. In all of the cases, singular meta may be a single type meta or list with one value.

Joining and cutting ++ --
Joining and cutting are used to combine two metadata into one. With list and singular type meta joining just concatenates lists together. Cutting in lists on the other hand removes all items from the first list that are present in the last one.
In case of dictionary type meta, joining adds keys from last meta to first, that are not already present. Cut between dictionary type meta removes key from first that are present in the last.

Functions
Functions are called by "func(metastatement)". They return a meta themselves so the are also a metastatement.

Sorting
Sorting functions change the order of elements in meta. Sorting of lists is very straightforward and with dictionaries, keys are sorted based on the values. asc asc sorts meta to ascending order.
des des sorts meta to descending order.

Statistical
There are two kinds of statistical functions. max, min and avg return the desired result of single meta, but mmax, mmin and mavg return the result for desired meta from the whole set of conformations with the same name.     rdup rdup returns given metastatement, but removes duplicates from the output.
confcol confcol returns an asked column of conformation atom block for each conformation. Returned meta is a dictionary type meta, whose key is atom number and value is requested column for said atom in atomblock. Argument should be an integer, when one may obtain numeric coordinates, etc. If argument is a list, one gets a strings for each atom combining the requested columns. Usually motivation to use confcol is to obtain atom types, chiralities, etc.
getmeta Allows user to call a stored meta with same name as given string type meta. Enables user to call "dynamic" metanames.

Metaslicing
Metastatements may include traditional (python-style) index-based slicing of lists and dictionaries. But one may also perform logical slicing of meta based on the values of the meta. In metastatements also the slicing values are cast into a meta data and they might be simple singular, explicit numeric values. In this case all values in meta are compared to the slicing value and only those fulfilling the comparison condition are selected to the new, sliced meta.
If slicing values are not singular, in case of list meta, slicing value must be of the same length as the meta to be sliced. Then the values of two lists are compared to each other in the order of appearance. In the case of dictionaries, if slicing meta is not singular, it must be an dictionary type and must contain all the keys present in the meta to be sliced. Then the values of the two dictionaries are compared key-wise.

(Non/In-)Equality == != >= <= < >
The comparisons are standard equality, non-equality and inequality. They are all treated the same way, except for that equality and non-equality can perform string comparisons with regular expressions, which are described in section Regular expression.

Metacomparison
Metacomparison is not part of metastaments per se, but they are rather comparisons that contain two metastatements separated by comparison operator. As opposed to metaslicing comparisons are always related to certain meta generated by a metastatement.
The purpose of metacomparisons is to select desired conformations from a group of molecules and conformations.
Quite unorthodoxically, metacomparisons on non-singular meta are NOT symmetric. Equality and non-equality comparison are symmetric, but inequalities are not. Symmetry is more described in specific sections of equalities. In general it is impossible to compare dictionaries with non-singular, list type meta. Metacomparisons are used for example by --extract and --proportion options. It is possible to invert the positive and negative results of a comparison by adding "-" to the beginning of comparison string. It's possible to explicitly use the default behaviour by adding "+ " to the beginning of the comparison, but it's not required.

(Non-)Equality == !=
When comparing meta for equality or non-equality, a positive result is triggered if even a single pair in two meta fulfils the condition. Therefore, not all elements of meta are required to be (non-)equal.

Inequality >= <= < >
When comparing meta with inequalities, one must note that comparisons are not symmetric. Therefore a<b might yield different result that b>a. When meta are non-singular, the logic for positive result is the following: "At least one element on the right hand side must fulfil the condition for all elements on the left hand side." For example 5>[3,5,7] is positive because 3 on the right side is smaller than all elements on the left side.
On the other hand, [3,5,7]<5 yields negative result, because not all on the left side are smaller than at least one on the right side. Again, [3,5,7]<[5,10] yields positive, because all one the left side are smaller than one (10) on the right side.

Regular expressions
One may create strings that are treated as regex in comparisons. Such meta value should be of form "REGEX:yourregex", where yourregex is the desired regular expression. REGEX must be in capital letters. Regex comparisons are only intepreted in equality and non-equality comparisons.

Conformation numbers
sdfconf always assigns conformation numbers to all molecules. Single conformation is identified by it's name and conformation number, therefore conformations with different names may have same conformation numbers. sdfconf may store conformation numbers explicitly in two ways. First way is in the name field of the molecule, this way the conformation closed in curly and square brackets. For example "phenytoin{[01]}". If SDfile loaded into sdfconf has these kind of names, the given conformation numbers are used. The second way to store conformation numbers is in a metafield "confnum". If conformations have a metafield with this name and it includes singular value, it is used as conformation number. If molecule has both values, it rises a warning message and conformation number in name is used.

--extract metachop
Extract is used to select or drop conformations from SDfile. metachop can have a few different forms: • metacomparison: metastatement logic metastatement ○ my_docking_score >= 100 ■ metafield "my_docking_score" has value equal or greater than 100 ○ my_docking_score >= 0.5*mmax(my_docking_score) ■ metafield "my_docking_score" has value at least half of the greatest value of "my_docking_score" occurring in conformations with same name ○ 5.0 < poi_distance(atoms_of_interest) ■ "poi_distance" is created with "closestatoms" option. "atoms_of_interest" is list of atom numbers that are somehow interesting. Conformation that have at least one interesting atom with distance less than 5.0 to point of interest are picked. (Note that if comparison was other way around, conformations whose all interesting atoms were closer than 5.0 to poi would be picked.) • part: mima(metastatement, amount) ○ MAX(my_docking_score, 3) ■ Three conformations with largest values of "my_docking_score" are picked • unique: !metastatement ○ !best_atom_from_test ■ "best_atom_from_test" is single atom number. Molecules with duplicate value of this atom number are dropped. ○ !asc(poi_distance(atoms_of_interest)){0} ■ conformations with unique closest atom number to poi are picked With metacomparison one defines to metastatements and compares them with selected logical operator. Those that yield positive result, are picked. With part, one uses one of two functions, MAX or MIN to pick conformations either with largest or smallest values of given metastatement. The amount is either a absolute number of conformations or percentage of conformations. Unique removes molecules that have duplicate values of given metastatement. The first occurrence of certain value gets picked. Note that you may give multiple, separate, logical statements resulting in "logical and"-like behaviour. You may also combine multiple logical statements with boolean operation "and" and "or" with "&" and "|" respectively.

Import Export
sdfconf can import new metadata into a SDfile from another SDfile (combine) or a comma-or tabseparated value file (csv) (import). In either case only data with matching names in both files will be imported. By default, names should be in the leftmost column. Names may also include conformation numbers which are discussed in section [Features/Conformation numbers]. Conformation numbers may also be given in specified column. If conformation numbers are not provided in imported data, data will be applied to molecules with the same name, where as if they are provided, data will be applied to molecules with same name and conformation number. When importing from normal csv-file, single values may appear as they are. In cases of lists and dictionaries, data must be in quotes (" or '), with values separated with comma or semicolon (, or ;). Dictionaries include a key and a keyed value in place of a single value. Key and keyed value are separated with a colon (:). In case of atomic csv-files, one column must specify atom numbers, which by default is column named "atom_number". Atomic csv-files are used to import atom-wise data. Single line should specify molecule name, possible conformation number and atom number. All atom numbers for single conformation are collected and dictionary type meta is generated for the conformation. Metadata may also be exported to a csv-file. Format is similar to that of imported csv-files. One should provide a string containing a list of metanames (separated by ','). Note that these cannot be metastatements. In the list of metanames '?' means 'all but'. For example '?,mydockingscore' would return csv with all metadata, but not 'mydockingscore'. It is also possible to export atomic csv file, which has a line for each atom for each conformation. Atom number is specified as first column and information in dictionaries is applied to them atom number wise. Other meta is applied to all lines. Currently it's not possible to read such import such .csv-files.

Cut
--cut path_to_file --allcut path_to_file Cut tools remove molecules or conformations from SDfile. The standard one requires matching names and conformation numbers, while the all variant requires only matching names. It accepts an SDfile or csv-file as argument. csv-file might be one column, i.e. namelist text file.
Distance calculation --closestatoms (xx, yy, zz) Distance calculation tools give information about distance of atoms to points of interest. There are two separate tools. First is closestatoms, which requires one point of interest (x,y,z) and then calculates distances from that point to all atoms of conformations and adds a meta with given name. Meta is of dictionary type and includes distances as values by atom numbers as keys. If metastatement interests is specified, distances are calculated only for atom numbers specified in it. If no name is given, meta is named "Closest_atoms". Second method is closeratoms that in addition to point of interest requires meta that defines a list of atom numbers. The method calculates which of the defined atoms is closest to the point of interest and how many atoms there are that are closer to the point of interest. In generates metafields "Closer_atoms_than_meta" and "Closest_atom_from_meta". Overlap tools require one given .mol2-or .sdf-file, selected molecule (actually treated as a set of points) from said file and range (sphere radius). This information is used to generate a volume consisting of spheres with given range and centers on given points. The algorithm calculates the atoms for each conformation that are either on the inside (addinside) or outside (addescape) of the generated volume. Finally atom numbers matching the query are returned. If max is given, algorithm will stop when given number of atoms are found. Value of max defaults to zero, which means infinite. Positive value of max tracks number of atoms in desired category, while negative number tracks the opposite category. This tool can be used for example to determine if a ligand conformation as a result of docking overlaps with protein. Or it can be used to obtain information used to select conformations that lie in desired active zone of protein.

--sortorder
Sortorder sorts the molecules in SDfile in ascending or descending order, based on values of a given metastatement. Add < (ascending) or > (descending) to the beginning of the metastatement to define the sorting order.

Config-files --config config_file.txt
Config tool enables one to use pre-created scripts to run complex procedures in one run. The syntax for one line of config file is the following: option_name :: argument1 ;; argument2 where • option_name is the short or long form name of sdfconf option • :: separates optionname from arguments • argument1 and argument2 are separate strings including one argument for given option • ;; separates arguments from each other Number of arguments may be zero or more. If there are no arguments, no :: is needed. ;; is given between arguments. No ;; is needed after the last argument. Config option is run after injection options, or in other words those that add information to run. Inside the config, options are processed in given order. Therefore one may write multiple outputs between operations, etc. Note that number of parameters might be limited by different options. Note that in config files there is an extra option "sdf" that changes output back to SDfile. This is needed in case of multiple outputs when there are other kinds of outputs before SDfile.

--stripbutmeta metastatement
Stripbutmeta inteprets given metastatement and removes atoms from conformations that are not defined in the list generated by the statement. This may be used to generate files that include atoms of interest for visualization purposes, etc. --getmol2 pathto.mol2,column,metaname --putmol2 input.mol2,output.mol2, column, metastatement, default, precision One may extract data from an external .mol2-file with option getmol2. The specified mol2.-file must contain exactly the same molecules as your .sdf in the same order. The function adds a metafield with specified name, which includes atom wise data in the specified column of .mol2file. Column indexing starts from 0. With putmol2 one may read input.mol2 and replace data in selected column in atom wise manner with data from metastatement. Default value is used to atoms that are not specified in the metastatement. All of injected data is written with specified precision (number of decimals) to output.mol2. Therefore the original file is not altered. With histogram it's possible to plot 1D histograms or 2D heatmaps. Metastatements define the values for points in x and y direction. Titles can be used to define shown names on the figure. args include other parameters into account. For both plots one may define bins variable, which defines the number of bins in the histogram. For 1D plot takes "bins=n" while 2D plot takes "bins=[n,m]". Scatter makes 2D scatter plots, in which you may group data points with a third metastatement. Whether or not data is grouped, one may also plot trend for the data and calculate corresponding R² value. Note that also normal output is made in addition to plots. You may omit normal output with option --donotprint.

--verbose
Verbose prints output on your run. The output is separate from sdf-related output to be written into file. Verbose output includes information on phases of running sdfconf, running times, etc.
Examples closestatoms, allcombine, getcsv, conftometa, makenewmeta command sdfconf manual_test_sdf_1.sdf -aco 1A2_sub.sdf -ctf -ca '(49.51, 47.21, 57.36), poi' -mnm 'poi5=poi(:5)' -gc 'ID,confnum,poi5' would yield the following output: ID confnum poi5 "NDEA" "2" "7:3.45, 6:4.812, 1:5.2848, 3:5.5616, 2:5.7138" "NDEA" "1" "7:3.7296, 6:4.9214, 3:5.2113, 1:5.477, 2:6.3938" "phenytoin" "4" According to order of execution, presented in General information, Order of execution, all presented options are run in certain order. 1. input: manual_test_sdf_1.sdf is read 2. conftometa: conformation numbers in molecule names are read and added to field "confnum" 3. allcombine: meta in 1A2_sub.sdf are read and added name-wise (without conformation numbers) to molecules in current run 4. closestatoms: for every conformation distance of all atoms (hydrogens ignored) to given point of interest are calculated. Dictionary meta with atom numbers as keys and distances as values is added to given metafied "poi". Dictionary is in ascending order by poi distance. 5. makenewmeta: metafield poi is read and five first appearing key-value pairs are sliced from it. This slice is added as metafield "poi5" 6. getcsv: a csv (or tab separated value) is generated with columns ID, confnum, poi5. "ID" is combined from 1A2_sub.sdf and "confnum" is generated from molecule name by conftometa option. poi5 was made with makenewmeta. 7. As no output file is not specified, results are printed to standard output. Note that only "input" is a positional argument. Input-file(s) must always be the first argument. All other arguments are optional and are executed in predefined order, so order in given command is irrelevant. Now we don't create poi5 and the csv-file, but we write all conformations with added conformation numbers, combined data and calculated distances in sdf-format to manual_test_2.sdf. This file is stored in appendix C.

Use as a library
After you have installed sdfconf, you may import the package to a python interpreter like ipython or to external script to help in parsing and handling data.

Internal structures
The package sdfconf includes a few modules. In general their internal functions are quite hierarchical. In following sections the modules and their most important classes are discussed. runner runner.py is responsible for handling of options given to sdfconf. It includes the runner class and command line UI.
runner Runner is in essence an interface between user interface and sdfconf itself. It defines how certain options are used and in what order. A single runner instance may include a single instance of Sdffile to which all operations are made. It also stores general run information like state of verbosity or proportion options. It is also responsible for all the messages generated by verbose option.
sdf Sdffile Sdffile is a class that options operations on the stage of complete SDfiles. It contains a python dictionary (._dictomoles), whose keys are common names of molecules in the file. Each keys corresponds to another dictionary which contains all conformations of molecules with the same name. Each conformation has a distinct conformation number (distinct for each molecule name). Each conformation is represented by a single instance of Sdfmole class. Sdffile also includes a list (._orderlist) of moleculename-conformation number pairs and the order of that list describes in which order molecules are in a file.

Sdfmole
Sdfmole is responsible for storing data of a single conformation. It stores in it's own structures data included in standard V2000 Molfile. That includes header block, counts line, atom block, bond block and properties block. All molecular data is initially stored in "dumb" format, i.e. it's just stored as list of strings. It also has a list of metanames that are associated to that conformation. Further it includes a dictionary that includes said metanames and an Sdfmeta instance for each name. Sdfmole class also contains methods that are used to manipulate single conformations. It also harbours the dreaded molelogic (and further tabiter) method that is mostly responsible for performing the magic of metastatements. Sdfmole along with Mol2Mol has the method atomsGenerator that is used in all structure based queries.

Sdfmeta
A single instance of Sdfmeta class includes data associated with a single datablock of a SDfile. So every datablock in a SDfile is stored in a single instance of Sdfmeta. Sdfmeta stores information of it's datastructure, datatype, possible delimiters, etc. As a class it also contains methods for handling operations within and between Sdfmeta.

Mol2File
Mol2File is an analogous to Sdffile, although it has only a few methods. It includes a list of Mol2Mol instances.

Mol2Mol
Mol2Mol is analogouts to Sdfmole, but it has only few methods and is much less organized. Mol2Mol along with Sdfmole has the method atomsGenerator that is used in structure based queries.

functions
Includes various functions used in other modules.

Findable
Findable is a helping class that includes a set of coordinates and can be used to if some of them are within range of some other point. Can be used for example in checking overlapping of two molecules. Approximately at least 10 times faster than brute force point to point comparison with datasets of size >100.