Sherlock—A Free and Open-Source System for the Computer-Assisted Structure Elucidation of Organic Compounds from NMR Data

The structure elucidation of small organic molecules (<1500 Dalton) through 1D and 2D nuclear magnetic resonance (NMR) data analysis is a potentially challenging, combinatorial problem. This publication presents Sherlock, a free and open-source Computer-Assisted Structure Elucidation (CASE) software where the user controls the chain of elementary operations through a versatile graphical user interface, including spectral peak picking, addition of automatically or user-defined structure constraints, structure generation, ranking and display of the solutions. A set of forty-five compounds was selected in order to illustrate the new possibilities offered to organic chemists by Sherlock for improving the reliability and traceability of structure elucidation results.


Introduction
NMR spectroscopy is the most widely used analytical method for the thorough identification of organic chemical compounds. Preliminarily to de novo structure elucidation by NMR, database lookup for an already known compound (dereplication) or for structure fragments has been included in several existing CASE systems using the spectral fingerprint of one-dimensional (1D) NMR experiments as a search key. Through-bond proximity relationships between atoms became available with the possibility of routinely recording two-dimensional (2D) NMR spectra. The typical collection of NMR data used for computer-assisted structure elucidation (CASE) might contain 1D spectra such as 1 H, 13 C and DEPT as well as the 2D HSQC, HMBC and COSY spectra. The HSQC experiment indicates direct bonds between heavy atoms (non-hydrogen) and protons whereas HMBC and COSY spectra bear hints about long-range chemical shift correlations between protons and heavy atoms (HMBC) or between protons (COSY). Together with a given molecular formula (MF) mainly determined by mass spectrometry (MS), the substructures and restrictions derived from 1D and 2D NMR spectra form the cornerstone of structure generation by CASE systems. These constraints and the ones resulting from fragment search shrink the chemical search space that would be otherwise too wide for practical applications [1][2][3][4][5][6][7][8].
Until now and to the best of our knowledge, only commercial and closed-source software exist which include a frontend offering spectrum processing routines to obtain structural information from NMR data and a backend which provides a suite of CASE-related routines. Table 1. Overview of the test set used for the validation of Sherlock with planar chemical structures, the number of results regarding the applied constraints and execution time (structure generation, filtering, ranking), the average deviation as well as the number of matching and total signals. Compound names without an asterisk indicate that the dereplication was successful by using the default settings. One asterisk means that parameter adjustments were needed to achieve dereplication, while entries with two asterisks were not found in the knowledge base. Until now and to the best of our knowledge, only commercial and closed-source software exist which include a frontend offering spectrum processing routines to obtain structural information from NMR data and a backend which provides a suite of CASE-related routines.
To close this gap, we developed Sherlock, a CASE expert system for the easy identification of known organic compounds and, when necessary, for their de novo structure elucidation.

The Test Dataset
The performance of Sherlock was evaluated with 45 test data sets which are available as freely accessible archives (see Section 5.2).
An overview of the structure search settings and results is given in Table 1. The system presented here was able to handle and solve problems with a number of heavy atoms up to 40 (test case 19). More is likely possible but was not evaluated. With the inclusion of the default settings, the generated lists of solutions contained the expected molecule in 37 of 45 cases. The last five problems (41 to 45) could not be solved within an acceptable time. There, the structure search was manually interrupted after three hours of computing time. This does not necessarily mean that the CASE system is not able to solve them. Further analysis by NMR experts might lead to results based on additional user-defined constraints.
The resolution parameters had to be adjusted in three examples (5, 17, and 25) to produce the expected structure. For example, the automatic determination to allow hetero-hetero bonds in study case 17 sets that property to false due to an occurrence of heterohetero atom bond lower than the threshold of 1%. Thus, the bond between the two nitrogen atoms was never formed. After allowing it manually the expected solution was generated. In another example, test case 25, for the carbon with a chemical shift of 131.4, the detection routine proposed an sp 2 hybridization and another carbon as a mandatory neighbour. Since it is supposed to be located between nitrogen and sulphur, a carbon as a direct neighbour was not appropriate here. Additionally, the list of possible hybridizations should include sp. Finally, to enable the generation of the desired molecule, the modification of the hybridization and set neighbours threshold to 0.1% and 100% was necessary. Table 1. Overview of the test set used for the validation of Sherlock with planar chemical structures, the number of results regarding the applied constraints and execution time (structure generation, filtering, ranking), the average deviation as well as the number of matching and total signals. Compound names without an asterisk indicate that the dereplication was successful by using the default settings. One asterisk means that parameter adjustments were needed to achieve dereplication, while entries with two asterisks were not found in the knowledge base. After the three necessary parameter adjustments and the elucidation for each of the first 40 test examples, the rank of the expected compound was determined. As a result, 38 of 40 candidate lists contained the desired structure in the first place whereas two of them were ranked #6 and #8. This indicates that the prediction and ranking modules of Sherlock are reliable and the expected compound is often situated in the top ten of the ranked candidate list. A full spectrum-to-spectrum assignment (Table 1) was produced in 31 cases among 40. A 0.83 ppm overall mean average chemical shift deviation value was achieved.
The number of unassigned signals ranges from one to four in the 9 remaining cases. One reason why the prediction of signals might fail is that there is no entry in the HOSE code library and thus no prediction value is available. This problem will be overcome when larger open-access collections of assigned NMR spectra will become available. Another cause for a missing signal could be a larger difference in the given maximum shift tolerance or average deviation. Furthermore, diastereotopic carbons might not be distinguished by their 3D HOSE code since the stereochemical information in the output of pyLSD is not provided and no method in Sherlock is implemented to provide it. In addition, the required stereochemical properties, as suggested [18,19], are not available for all compounds in Sherlock's structure-to-spectrum database which leads to the same problem (mentioned above) during the HOSE code knowledge base creation and thus also affects the prediction capability.
Sherlock provides the search for fragments for 13 C query spectra. This was tested for the first 40 case studies as well, by incorporating the first entry of the ranked fragment list in the input file of the structure generator. In the vast majority of the cases (32) the number of candidates was equal to one and the solution was the expected one. Only six of the After the three necessary parameter adjustments and the elucidation for each of the first 40 test examples, the rank of the expected compound was determined. As a result, 38 of 40 candidate lists contained the desired structure in the first place whereas two of them were ranked #6 and #8. This indicates that the prediction and ranking modules of Sherlock are reliable and the expected compound is often situated in the top ten of the ranked candidate list. A full spectrum-to-spectrum assignment (Table 1) was produced in 31 cases among 40. A 0.83 ppm overall mean average chemical shift deviation value was achieved.
The number of unassigned signals ranges from one to four in the 9 remaining cases. One reason why the prediction of signals might fail is that there is no entry in the HOSE code library and thus no prediction value is available. This problem will be overcome when larger open-access collections of assigned NMR spectra will become available. Another cause for a missing signal could be a larger difference in the given maximum shift tolerance or average deviation. Furthermore, diastereotopic carbons might not be distinguished by their 3D HOSE code since the stereochemical information in the output of pyLSD is not provided and no method in Sherlock is implemented to provide it. In addition, the required stereochemical properties, as suggested [18,19], are not available for all compounds in Sherlock's structure-to-spectrum database which leads to the same problem (mentioned above) during the HOSE code knowledge base creation and thus also affects the prediction capability.
Sherlock provides the search for fragments for 13 C query spectra. This was tested for the first 40 case studies as well, by incorporating the first entry of the ranked fragment list in the input file of the structure generator. In the vast majority of the cases (32) the number of candidates was equal to one and the solution was the expected one. Only six of the ** Gladiofungin A

No result
After the three necessary parameter adjustments and the elucidation for each of the first 40 test examples, the rank of the expected compound was determined. As a result, 38 of 40 candidate lists contained the desired structure in the first place whereas two of them were ranked #6 and #8. This indicates that the prediction and ranking modules of Sherlock are reliable and the expected compound is often situated in the top ten of the ranked candidate list. A full spectrum-to-spectrum assignment (Table 1) was produced in 31 cases among 40. A 0.83 ppm overall mean average chemical shift deviation value was achieved.
The number of unassigned signals ranges from one to four in the 9 remaining cases. One reason why the prediction of signals might fail is that there is no entry in the HOSE code library and thus no prediction value is available. This problem will be overcome when larger open-access collections of assigned NMR spectra will become available. Another cause for a missing signal could be a larger difference in the given maximum shift tolerance or average deviation. Furthermore, diastereotopic carbons might not be distinguished by their 3D HOSE code since the stereochemical information in the output of pyLSD is not provided and no method in Sherlock is implemented to provide it. In addition, the required stereochemical properties, as suggested [18,19], are not available for all compounds in Sherlock's structure-tospectrum database which leads to the same problem (mentioned above) during the HOSE code knowledge base creation and thus also affects the prediction capability.
Sherlock provides the search for fragments for 13 C query spectra. This was tested for the first 40 case studies as well, by incorporating the first entry of the ranked fragment list in the input file of the structure generator. In the vast majority of the cases (32) the number of candidates was equal to one and the solution was the expected one. Only six of the problems resulted in more than one solution but mostly with a massive reduction in solution size. In two cases (14 and 17) no structure was produced due to improper fragment proposal. Parameter adjustments for the fragment search in case of example 9, 30 and 37 led to proper first fragment suggestions and the production of results, including the expected one. The impact of the inclusion of the first proposed fragment in solution structures is illustrated in Figure 1.
For example, test case 22 shows a drastic reduction of the solution set size. From 336 solutions without fragment data, only one structure was left. This was due to the discovery of a fragment covering most of the structure, i.e., the quaternary carbon atoms for which no correlations could be identified in the HMBC spectrum ( Figure 2). In addition, the connections of all hetero atoms (oxygens) are provided which otherwise would increase the search space enormously without any further structural information.
problems resulted in more than one solution but mostly with a massive reduction in solution size. In two cases (14 and 17) no structure was produced due to improper fragment proposal. Parameter adjustments for the fragment search in case of example 9, 30 and 37 led to proper first fragment suggestions and the production of results, including the expected one. The impact of the inclusion of the first proposed fragment in solution structures is illustrated in Figure 1. For example, test case 22 shows a drastic reduction of the solution set size. From 336 solutions without fragment data, only one structure was left. This was due to the discovery of a fragment covering most of the structure, i.e., the quaternary carbon atoms for which no correlations could be identified in the HMBC spectrum ( Figure 2). In addition, the connections of all hetero atoms (oxygens) are provided which otherwise would increase the search space enormously without any further structural information.

Test Case 15
This section provides a detailed demonstration of the structure elucidation workflow in Sherlock. First, the available NMR spectra ( 1 H, 13 C, HSQC, HMBC, COSY) of compound 15 ( Figure 3) were imported into NMRium (Spectra tab). Tetrahydrofuran (THF, see Figure  4) was used as a solvent.

Test Case 15
This section provides a detailed demonstration of the structure elucidation workflow in Sherlock. First, the available NMR spectra ( 1 H, 13 C, HSQC, HMBC, COSY) of compound 15 ( Figure 3) were imported into NMRium (Spectra tab). Tetrahydrofuran (THF, see Figure 4) was used as a solvent.
in the elucidation process led to reducing the number of solutions from 336 to one.

Test Case 15
This section provides a detailed demonstration of the structure elucidation workflow in Sherlock. First, the available NMR spectra ( 1 H, 13 C, HSQC, HMBC, COSY) of compound 15 ( Figure 3) were imported into NMRium (Spectra tab). Tetrahydrofuran (THF, see Figure  4) was used as a solvent.

Test Case 15
This section provides a detailed demonstration of the structure elucidation workflow in Sherlock. First, the available NMR spectra ( 1 H, 13 C, HSQC, HMBC, COSY) of compound 15 ( Figure 3) were imported into NMRium (Spectra tab). Tetrahydrofuran (THF, see Figure  4) was used as a solvent.  The panel in the upper right corner in Figure 4 shows that the spectra were loaded in the frontend (tabs in red frame). The summary panel in the lower right corner contains all the atoms added as placeholders to take into account the user-supplied MF (C 12 H 12 O 5 ). The left side panel shows the 13 C NMR spectrum.
The summary table was then updated according to the positions of the peaks detected in the 1D and 2D spectra. All twelve 13 C signals were identified and the counter for carbon atoms in the summary panel (blue frame in Figure 5) appeared green. Solvent signals can be marked as such. The correlation data extraction routines do not consider signals which have been changed to another signal kind than "signal" (default, see Figure 5). In this example, solvent peaks were not picked.
The 1 H spectrum analysis identified ten chemical shift ranges, each with a single signal and a relative integration value ( Figure 6). To set the number of expected protons and to calculate the relative integrals, the button to change the sum of all ranges was used (red framed in Figure 6). Eight ranges had a relative integral value close to one and two of them close to two.
The summary table was then updated according to the positions of the peaks detected in the 1D and 2D spectra. All twelve 13 C signals were identified and the counter for carbon atoms in the summary panel (blue frame in Figure 5) appeared green. Solvent signals can be marked as such. The correlation data extraction routines do not consider signals which have been changed to another signal kind than "signal" (default, see Figure 5). In this example, solvent peaks were not picked. The 1 H spectrum analysis identified ten chemical shift ranges, each with a single signal and a relative integration value ( Figure 6). To set the number of expected protons and to calculate the relative integrals, the button to change the sum of all ranges was used (red framed in Figure 6). Eight ranges had a relative integral value close to one and two of them close to two.
The HSQC spectrum contains multiplicity-sensitive information (Figure 7). The summary table was filled up with "S+" (single bond, positive peak intensity) or "S-" (single bond, negative peak intensity) during the peak zones picking. Based on this, the system pre-sets the number of attached protons to the matching heavy atom. For positive it will be "1,3" (one or three), which has to be set to one of those values by the user, and for negative the value of 2. In total, 9 signals were picked, two of which belong to the diastereotopic proton pair bound to carbon C8. A proton signal (H10) was left unassigned. The multiplicity of all carbons correlating to a positive HSQC signal was set to 1 since no integration value with close to 3 exists in the 1 H spectrum.  The HSQC spectrum contains multiplicity-sensitive information (Figure 7). The summary table was filled up with "S+" (single bond, positive peak intensity) or "S-" (single bond, negative peak intensity) during the peak zones picking. Based on this, the system pre-sets the number of attached protons to the matching heavy atom. For positive it will be "1,3" (one or three), which has to be set to one of those values by the user, and for negative the value of 2. In total, 9 signals were picked, two of which belong to the diastereotopic proton pair bound to carbon C8. A proton signal (H10) was left unassigned. The multiplicity of all carbons correlating to a positive HSQC signal was set to 1 since no integration value with close to 3 exists in the 1 H spectrum.  The HMBC spectrum and correlated chemical shifts are shown in Figure 8. The "M" in the summary table symbolises the multiple bond correlations established from HMBC signals. In addition, Figure 9 shows the signals in the COSY spectrum. The HMBC spectrum and correlated chemical shifts are shown in Figure 8. The "M" in the summary table symbolises the multiple bond correlations established from HMBC signals. In addition, Figure 9 shows the signals in the COSY spectrum.
H10 is still unassigned. Due to the absence of any other heavy atom where it could bind to, this proton can only be bound to one of the oxygens (O1 in this case). This was assigned by a right-click on the corresponding cell ( Figure 10). The colour for the displayed number of assigned protons in the panel header changed to green, thus meaning that all hydrogen atoms were bound to a heavy atom. Note that H10 also presents HMBC correlations.
Subsequently, the CASE tab was selected to switch to CASE-related overviews and settings. In the query panel on the right side, the Elucidation tab is selected. Clicking the Detect button (red framed in Figure 11) started the analysis routines. The results are visible on the left side of the CASE tab. The output of the hybridization or neighbourhood detection is shown in the non-neighbour and neighbour columns in the summary table. In the case of atom C3 (113.19 ppm), two hybridizations state proposals were present while this information was unambiguous for the others. That means the frequency rate of these two hybridizations was at least 1% among all occurrences for this requested signal position, multiplicity and proposed MF. The spectral characteristics of atoms C1 (62.79 ppm) and C2 (64.25 ppm) lead to the statistics-based assumption of having at least one carbon and oxygen as neighbours. In addition, oxygen is considered a forbidden neighbour for C4 to C8 (118.34 to 131.28 ppm).  H10 is still unassigned. Due to the absence of any other heavy atom where it could bind to, this proton can only be bound to one of the oxygens (O1 in this case). This was assigned by a right-click on the corresponding cell ( Figure 10). The colour for the displayed number of assigned protons in the panel header changed to green, thus meaning that all hydrogen atoms were bound to a heavy atom. Note that H10 also presents HMBC  H10 is still unassigned. Due to the absence of any other heavy atom where it could bind to, this proton can only be bound to one of the oxygens (O1 in this case). This was assigned by a right-click on the corresponding cell ( Figure 10). The colour for the displayed number of assigned protons in the panel header changed to green, thus meaning that all hydrogen atoms were bound to a heavy atom. Note that H10 also presents HMBC  Subsequently, the CASE tab was selected to switch to CASE-related overviews and settings. In the query panel on the right side, the Elucidation tab is selected. Clicking the Detect button (red framed in Figure 11) started the analysis routines. The results are visible on the left side of the CASE tab. The output of the hybridization or neighbourhood detection is shown in the non-neighbour and neighbour columns in the summary table. In the case of atom C3 (113.19 ppm), two hybridizations state proposals were present while this information was unambiguous for the others. That means the frequency rate of these two hybridizations was at least 1% among all occurrences for this requested signal position, multiplicity and proposed MF. The spectral characteristics of atoms C1 (62.79 ppm) and C2 (64.25 ppm) lead to the statistics-based assumption of having at least one carbon and oxygen as neighbours. In addition, oxygen is considered a forbidden neighbour for C4 to C8 (118.34 to 131.28 ppm). The Elucidation button (at the bottom of the Elucidation tab, not visible here) started the structure generation process which in this case completed in about one second. Four structures were suggested (Figure 12), with the expected one at the first place in the list. The chemical shifts of all carbons were predicted and the average deviation to experimental values was about 0.63 ppm.
No prior knowledge or user influence, such as a manually added or imported fragment, was used in this example, showing that no change of the CASE settings was needed to produce the correct structure. Additionally, the expected structure appeared at the first place in the ranking list. To further narrow down the number of results the fragment

Methods
The Sherlock CASE software consists of two parts, a frontend and a backend ( Figure  14). The frontend acts as a graphical user interface (GUI). It serves for spectra and structure visualisation and for the adjustment of parameters related to NMR data processing and to CASE tasks. The backend runs services that enable database lookup for the identi- No prior knowledge or user influence, such as a manually added or imported fragment, was used in this example, showing that no change of the CASE settings was needed to produce the correct structure. Additionally, the expected structure appeared at the first place in the ranking list. To further narrow down the number of results the fragment search was carried out and the most likely fragment selected for inclusion in the solution structures. Starting the elucidation again left only one candidate in the result list ( Figure 13). Thus, the most likely fragment was successfully incorporated in the unambiguous solution of test case 15.

Methods
The Sherlock CASE software consists of two parts, a frontend and a backend ( Figure  14). The frontend acts as a graphical user interface (GUI). It serves for spectra and structure visualisation and for the adjustment of parameters related to NMR data processing

Methods
The Sherlock CASE software consists of two parts, a frontend and a backend ( Figure 14). The frontend acts as a graphical user interface (GUI). It serves for spectra and structure visualisation and for the adjustment of parameters related to NMR data processing and to CASE tasks. The backend runs services that enable database lookup for the identification of known compounds (dereplication) and for the proposal of structural constraints and molecular fragments useful to narrow the chemical search space during elucidation. Tools for 13 C NMR chemical shift prediction as well as solution structure filtering and ranking complete the set of functions present in CASE programs [2,8]. In addition, the backend also manages the storage and retrieval of previously obtained elucidation results. fication of known compounds (dereplication) and for the proposal of structural constraints and molecular fragments useful to narrow the chemical search space during elucidation. Tools for 13 C NMR chemical shift prediction as well as solution structure filtering and ranking complete the set of functions present in CASE programs [2,8]. In addition, the backend also manages the storage and retrieval of previously obtained elucidation results. Sherlock assumes that 1D and 2D NMR data and the molecular formula of a pure compound are provided as input. In the end, it returns a list of ranked candidate structures that satisfy the constraints expressed by input data.

Structure-and-Spectrum Database Design
Sherlock offers several CASE-related services which rely on its own knowledge base of structure and spectrum assignments [7,8,16,20]. This knowledge base is involved in the dereplication from a list of 13 C NMR chemical shifts and is used for the construction of a molecular fragments library. The calculation of the probability for a given chemical shift to be related to a particular structural feature, as used for constrained structure generation, also depends on Sherlock's internal database.
The knowledge base consists of 892.841 records. Each record contains the full description of a molecular structure as a collection of atoms and bonds. Each 13 C NMR chemical shift value is associated with a multiplicity (the number of attached protons) and an equivalence index. This index indicates the number of chemically equivalent carbon atoms existing in a molecule for a given chemical shift.
Each newly inserted spectrum is checked during database construction for the existence of signals with the same chemical shift and multiplicity. If identity occurs, then the equivalence descriptor of the concerned atoms is set to the number of such signals. Consequently, the sum of all equivalence indexes of a spectrum is equal to the number of carbons in the currently inserted molecule. Figure 15 and Table 2 contain an example where the equivalence index of two signals from a monosubstituted aromatic ring is different from 1. Atoms 6 and 7 as well as 9 and 10 are assigned to a single 13 C NMR signal in each case. Sherlock assumes that 1D and 2D NMR data and the molecular formula of a pure compound are provided as input. In the end, it returns a list of ranked candidate structures that satisfy the constraints expressed by input data.

Structure-and-Spectrum Database Design
Sherlock offers several CASE-related services which rely on its own knowledge base of structure and spectrum assignments [7,8,16,20]. This knowledge base is involved in the dereplication from a list of 13 C NMR chemical shifts and is used for the construction of a molecular fragments library. The calculation of the probability for a given chemical shift to be related to a particular structural feature, as used for constrained structure generation, also depends on Sherlock's internal database.
The knowledge base consists of 892.841 records. Each record contains the full description of a molecular structure as a collection of atoms and bonds. Each 13 C NMR chemical shift value is associated with a multiplicity (the number of attached protons) and an equivalence index. This index indicates the number of chemically equivalent carbon atoms existing in a molecule for a given chemical shift.
Each newly inserted spectrum is checked during database construction for the existence of signals with the same chemical shift and multiplicity. If identity occurs, then the equivalence descriptor of the concerned atoms is set to the number of such signals. Consequently, the sum of all equivalence indexes of a spectrum is equal to the number of carbons in the currently inserted molecule. Figure 15 and Table 2 contain an example where the equivalence index of two signals from a monosubstituted aromatic ring is different from 1. Atoms 6 and 7 as well as 9 and 10 are assigned to a single 13 C NMR signal in each case.  Each 13 C spectrum entry in NMRShiftDB [21] was stored in the format p Figure 15 and Table 2 to build the reference database. In total 27.938 experi predicted spectra were extracted. In addition, the chemical shift verification vanced Chemistry Development, Inc. (ACD Labs) C+H Predictors and DB so used to predict the 13 C spectra of all molecules (864.903) in the natural produ COCONUT [22]. These structures and spectra were subsequently inserted int knowledge base.

Spectra Tab
The NMR spectra are displayed and analysed in Sherlock's frontend by m open-source NMR software component NMRium [23]. Spectrum analysis in consists of extracting the position of spectral peaks in 1D and 2D spectra in ord lish lists of chemical shift values and of coupling-mediated correlations bet The full description of the 13 C query spectrum (Table 2) is the minimum requ the dereplication. In the case of an elucidation procedure, a molecular formul tory. The standard proton ( 1 H) spectrum is recommended to enhance the upc ysis. The multiplicity of the 13 C NMR signals can be deduced automatically fro plementary DEPT-90 and DEPT-135 spectra. The 1 H,X-HMQC/HSQC, 1 H,X-H 13 C or, less frequently, 15 N), and 1 H, 1 H-COSY spectra constitute the minimu NMR data necessary for automatic structure generation. The recording of m edited 2D HSQC (me-HSQC) spectra constitutes a good alternative to the time acquisition of DEPT spectra. Moreover, 13 C signal multiplicity can be autom duced from me-HSQC spectra.  Each 13 C spectrum entry in NMRShiftDB [21] was stored in the format presented in Figure 15 and Table 2 to build the reference database. In total 27.938 experimental and predicted spectra were extracted. In addition, the chemical shift verification tool in Advanced Chemistry Development, Inc. (ACD Labs) C+H Predictors and DB software was used to predict the 13 C spectra of all molecules (864.903) in the natural product database COCONUT [22]. These structures and spectra were subsequently inserted into Sherlock's knowledge base.

Spectra Tab
The NMR spectra are displayed and analysed in Sherlock's frontend by means of the open-source NMR software component NMRium [23]. Spectrum analysis in this context consists of extracting the position of spectral peaks in 1D and 2D spectra in order to establish lists of chemical shift values and of coupling-mediated correlations between them. The full description of the 13 C query spectrum (Table 2) is the minimum requirement for the dereplication. In the case of an elucidation procedure, a molecular formula is mandatory. The standard proton ( 1 H) spectrum is recommended to enhance the upcoming analysis. The multiplicity of the 13 C NMR signals can be deduced automatically from the complementary DEPT-90 and DEPT-135 spectra. The 1 H,X-HMQC/HSQC, 1 H,X-HMBC, (X = 13 C or, less frequently, 15 N), and 1 H, 1 H-COSY spectra constitute the minimum set of 2D NMR data necessary for automatic structure generation. The recording of multiplicity-edited 2D HSQC (me-HSQC) spectra constitutes a good alternative to the time-consuming acquisition of DEPT spectra. Moreover, 13 C signal multiplicity can be automatically deduced from me-HSQC spectra.
The analysis of 1D and 2D NMR spectra results in a correlation table (Figure 16). The frontend allows the user to check the validity of spectrum analysis results and to edit it if incorrect data interpretation occurs, before submission to the backend algorithms for dereplication and structure generation.  Figure 17 shows the available control panels in the CASE tab which assist the user in the elaboration of a spectral data summary and the representation of the molecular connectivity diagram (MCD) [1,8,20,24] derived from 2D NMR correlations. Figure 17 also shows which neighbourhood restrictions for carbons or fragments were automatically detected (Sections 3.3.2 and 3.3.3) or were manually added by the user. The display component on the right of Figure 17 shows a control panel that allows the user to set specific parameter values needed for dereplication and elucidation. After one of these two tasks is completed the view switches to the result panel which shows the ranked candidate list. Meta information is provided close to structure drawings, in order to help the user to assess the result. Each list of candidate structures produced by an elucidation procedure is stored in Sherlock's backend and is retrievable at any time.  Figure 17 shows the available control panels in the CASE tab which assist the user in the elaboration of a spectral data summary and the representation of the molecular connectivity diagram (MCD) [1,8,20,24] derived from 2D NMR correlations. Figure 17 also shows which neighbourhood restrictions for carbons or fragments were automatically detected (Sections 3.3.2 and 3.3.3) or were manually added by the user. The display component on the right of Figure 17 shows a control panel that allows the user to set specific parameter values needed for dereplication and elucidation. After one of these two tasks is completed the view switches to the result panel which shows the ranked candidate list. Meta information is provided close to structure drawings, in order to help the user to assess the result. Each list of candidate structures produced by an elucidation procedure is stored in Sherlock's backend and is retrievable at any time.

CASE Tab
Further details on how to operate the Sherlock system can be found in the user manual which is embedded in the GUI and thus directly readable in the frontend on the Info tab. The user manual is available in the frontend repository (Section 5.1) as well.

Dereplication
Sherlock supports structural search dereplication, a lookup into a structure-to-spectrum database that prevents any re-elucidation of an already known chemical compound or to retrieve very similar ones [2,4,[6][7][8]25]. A structure-to-spectrum knowledge base is necessary for this purpose (see above). The user can select parameters that influence the database screening and the filtering of the result list, such as the tolerance value or maximum allowed average deviation during the chemical shift matching between a 13 C NMR query spectrum and the ones stored in the database. Further details on how to operate the Sherlock system can be found in the user m ual which is embedded in the GUI and thus directly readable in the frontend on the tab. The user manual is available in the frontend repository (Section 5.1) as well.

Dereplication
Sherlock supports structural search dereplication, a lookup into a structure-to-sp trum database that prevents any re-elucidation of an already known chemical compou or to retrieve very similar ones [2,4,[6][7][8]25]. A structure-to-spectrum knowledge bas necessary for this purpose (see above). The user can select parameters that influence database screening and the filtering of the result list, such as the tolerance value or m mum allowed average deviation during the chemical shift matching between a 13 C N query spectrum and the ones stored in the database.
Spectrum comparison relies on a distance calculated for all valid signal pairs. A nal pair is considered valid if the chemical shift matching is successful. Complement criteria take into account signal multiplicity or equivalence and can be enabled or disab The closest valid signal pairs are then taken into account for the spectrum-to-sp trum matching. Depending on the number of signals which can be assigned, some of th may be left unpaired because they have no matching counterpart in the other spectru Only the matching signal pairs are considered for distance measurement between spec Spectrum comparison relies on a distance calculated for all valid signal pairs. A signal pair is considered valid if the chemical shift matching is successful. Complementary criteria take into account signal multiplicity or equivalence and can be enabled or disabled.
The closest valid signal pairs are then taken into account for the spectrum-to-spectrum matching. Depending on the number of signals which can be assigned, some of them may be left unpaired because they have no matching counterpart in the other spectrum. Only the matching signal pairs are considered for distance measurement between spectra.

Fragment Library
A list of fragments which should be present in each solution, or goodlist [26,27], can be forwarded to Sherlock. Such substructural information can dramatically shrink the number of possible constitutional isomers after structure generation. Moreover, providing fragments helps to cope with the lack of atom proximity knowledge in problems where proton-poor parts of a molecule exist and thus for which important structural information is unreachable by commonly used heteronuclear 2D NMR experiments HSQC and HMBC. Therefore, fragment data can lead to a noticeable reduction of the candidate list size or give hints to the user who might add complementary structural restrictions manually. [4,5,7,8,24,28] In order to build a fragment library all entries in the structure-and-spectrum database were used for the fragmentation. Every fragment was created by spherical propagation following the instruction by Elyashberg et al. [8]. Starting at an atom in a molecule, specific conditions preserve connections between heavy atoms are applied to keep important substructural characteristics, such as no bond removal between carbons and hetero atoms or within a ring system if one of the atoms is a starting point [8]. The fragment library consists of around 24.5 million records. Each has a bit string representation to indicate whether a given chemical shift in the assigned subspectrum exists [7,8,28,29]. The database is screened via a bit string comparison during a fragment search where all set bits of a fragment have to be present in the query bitset. Afterwards, every fragment is ranked first by its number of heavy atoms and second by the same spectral matching procedure which is applied during the dereplication (Section 3.3.1).

Statistics-Driven Generation of Structural Constraints
The elucidation process is further supported by the statistical analysis of the structure and spectra database for the determination of complementary structural restrictions. Sherlock is able to detect the likely hybridization states of atoms as well as their forbidden and mandatory atom neighbourhoods.
The previously collected spectral database was used to count what hybridizations or connected atoms for a carbon atom appear. This information is coupled to a tuple consisting of 13 C chemical shift value, multiplicity and the elemental composition (MF) of a molecule so that for every 13 C signal in a query spectrum the probability of each hybridization state and of neighbouring atom type can be extracted.
A chemical shift value, a multiplicity and a molecular formula need to be provided in order to request statistical information about hybridization states. In addition, a lower boundary (in percent) is expected to define a minimum occurrence rate for each detected hybridization state compared to all hybridizations stored for a given shift range in the underlying database. If the frequency of a specific hybridization does not reach that given threshold (1% by default), then it will be discarded.
Similar to the hybridization search, a minimum occurrence (1% by default) of a neighbour atom type is required. Otherwise, such an atom type will be treated as nonneighbor (forbidden) for the carbon atom(s) bearing that 13 C request signal. If an atom type appears more or equally often compared to the upper boundary (95% by default) it will be considered as a mandatory neighbour. All elements with an occurrence between those two boundaries might or might not be neighbours of an atom during the structure generation.
Sherlock also checks the frequencies of connections between hetero atoms for a given MF. If the amount of such connections reaches a minimal occurrence of 1% among all connections, then hetero-hetero bonds (HHB) are allowed during structure generation.

Structure Generation
PyLSD [12,30], a free and open-source powerful software for CASE, takes charge of the structure generation task in the Sherlock backend. It relies on the LSD [11,31] structure elucidation software and provides the ability to deal with atoms with incompletely defined multiplicity or hybridization state. User-defined and automatically detected constraints are passed to pyLSD. Its built-in mechanism of solution ranking was disabled in Sherlock and replaced by a more recent tool (Section 3.3.5).

Spectra Prediction and Ranking
PyLSD-generated candidate structures are ranked according to the similarity of predicted and measured spectra. Prediction relies on a HOSE code-based [32] approach commonly used in CASE systems [2,4,7,8,12,21]. The prediction tool in Sherlock makes use of stereo-enhanced HOSE codes [18].
The number of spheres involved in the creation of the HOSE code library ranges from one to six. During spectra prediction, the HOSE code for the highest number of spheres is created first. If there is no matching entry in the knowledge base, the number of spheres is decreased until matching becomes possible or the number of spheres reaches zero. In the latter case, the prediction for the carbon atom is not possible since no values for a prediction exist. [21] During the prediction, the number of HOSE code spheres in use, the number of entries as well as chemical shift range are stored to enable a posteriori quality assessment. The final step of the structure elucidation process is to rank the candidate list according to the spectral similarity between the predicted spectra and the experimental one [20].
The result (limited to 500 structures) and the CASE-related settings are stored in Sherlock's backend service to be retrieved at any time.

Conclusions
An unambiguous interpretation of NMR data can be challenging, due to the many combinatorial possibilities that arise in constitutional space. To address this problem and to support organic chemists in structure determination tasks we introduced Sherlock, a free and open-source software for computer-assisted structure elucidation (CASE). Sherlock's functionality covers the common steps of structure determination.
It provides the processing and visualisation of NMR data produced in common NMR file formats (Bruker, JEOL, JCAMP-DX). Here, the range of functionality extends from data processing to automatic peak picking up to a summary of all correlations between the different 1D and 2D spectra.
Furthermore, a given molecular formula and the correlation information are used for the dereplication through a spectral knowledge base or the de novo elucidation of an unknown compound. The latter includes a lookup for structural constraints for carbon atoms derived by different statistics or fragment search which serve as input for the structure generator and make it possible to further reduce the set of solutions significantly.
Finally, the user receives a list of structure proposals ranked according to the similarity of each predicted spectrum to the experimental one. The results are stored in the system and can be retrieved at any time. The system is able to handle and solve most of the 45 problems used for validation, even with a heavy atom number up to forty.

Implementation
The frontend and backend are fully separated software pieces and exchange data only. Hence, they work independently from each other, a feature which enables the general replacement of one of these two components if desired.
Frontend and backend are available from the internet as Docker containers and can either be run locally or deployed on any cloud system that supports Docker. No login functionality is implemented so far. In a publicly accessible server-based solution, any user can access the results and data of others, even on an offline computer with multiple users. A login feature will be implemented in the future to provide data confidentiality to the users.
The backend system supports the storage of atom environments and NMR chemical shift values in the knowledge base (Section 3.1) for a wide range of nuclei types, including 1 H or 15 N. Nevertheless, the presently implemented knowledge base contains 13 C information only and hence dereplication (Section 3.3.1), spectrum prediction and solution ranking (Section 3.3.5) rely solely on 13 C NMR. The incorporation of NMR data of other nuclei types (e.g., 1 H, 15 N) and software developments will be necessary to expand the scope of these operations. Consequently, the CASE tab in the frontend will be extended in future works to enable for multiple spectrum-to-spectrum comparisons and the display of their results.
The backend system uses CASEkit (https://github.com/michaelwenk/casekit, accessed on 22 November 2022), a computational library for computer-assisted structure elucidation which is based on Java and the Chemistry Development Kit [33,34].

Software and Test Data
For the processing and results presented in this manuscript, the Digital Object Identifiers (DOI) to the free accessible archived software and complete test data, including the NMRium files used for the CASE purposes, are given in Tables 3 and 4. The aim is to follow the idea of Research Objects [35] and FAIR data [36] principles. The structures of the test datasets are provided in the supplementary materials. Table 3. Overview and DOIs belonging to software archives used in frontend and backend services of Sherlock.

Description DOI
As mentioned in the introduction and to the best of our knowledge, there is no other free and open-source CASE tool with a similar set of both spectral processing and CASErelated functionalities. Hence, a fair and comprehensive comparison between Sherlock and those CASE tools is not possible due to their commercial nature.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/molecules28031448/s1, The forty-five structures of the test dataset are available as SMILES.