Chemical features mining provides new descriptive structure-odor relationships

An important goal in researching the biology of olfaction is to link the perception of smells to the chemistry of odorants. In other words, why do some odorants smell like fruits and others like flowers? While the so-called stimulus-percept issue was resolved in the field of color vision some time ago, the relationship between the chemistry and psycho-biology of odors remains unclear up to the present day. Although a series of investigations have demonstrated that this relationship exists, the descriptive and explicative aspects of the proposed models that are currently in use require greater sophistication. One reason for this is that the algorithms of current models do not consistently consider the possibility that multiple chemical rules can describe a single quality despite the fact that this is the case in reality, whereby two very different molecules can evoke a similar odor. Moreover, the available datasets are often large and heterogeneous, thus rendering the generation of multiple rules without any use of a computational approach overly complex. We considered these two issues in the present paper. First, we built a new database containing 1689 odorants characterized by physicochemical properties and olfactory qualities. Second, we developed a computational method based on a subgroup discovery algorithm that discriminated perceptual qualities of smells on the basis of physicochemical properties. Third, we ran a series of experiments on 74 distinct olfactory qualities and showed that the generation and validation of rules linking chemistry to odor perception was possible. Taken together, our findings provide significant new insights into the relationship between stimulus and percept in olfaction. In addition, by automatically extracting new knowledge linking chemistry of odorants and psychology of smells, our results provide a new computational framework of analysis enabling scientists in the field to test original hypotheses using descriptive or predictive modeling.


Introduction
Around the turn of the century, with its acknowledgement as an object of science by the Nobel society [1] the hidden sense associated with the perception of odorant chemicals, hitherto considered superfluous to cognition, became a focus of study in its own right. Odors are emitted by food, which is a source of pleasure [2]; they also influence our relations with others [3]. The olfactory percept encoded in odorant chemicals contributes to our emotional balance and wellbeing: olfactory impairment jeopardizes this equilibrium [4,5].
Neuroscientific studies have revealed that odor perception is the consequence of a complex phenomenon rooted in the chemical properties of a volatile molecule (described by multiple physicochemical descriptors) further detected by our olfactory receptors in the nasal cavity [6]. A neural signal is then transmitted to central olfactory brain structures [7]. At this stage, a complete neural representation, called "odor" is generated and then, it can be described semantically by various types of perceptual qualities (e.g., musky, fruity, floral, woody etc.). While it is generally agreed that the physicochemical characteristics of odorants affect the olfactory percept, no simple and/or universal rule governing this Structure Odor Relationship (SOR) has yet been identified. Why does one odorant smell of rose and another smell of lemon? Given the fact that the totality of the odorant message was encoded within the chemical structure, chemists have tried for a long time to identify relationships between chemical properties and odors. Topological descriptors, eventually associated with electronic properties or molecular flexibility, have been tentatively connected to odorant descriptors. For instance, molecules carrying a sulfur atom and/or having low molecular weight or low structural complexity are often rated as unpleasant [8][9][10]. In addition to the hedonic valence of odors, others have looked for predictive models describing odor perception and quality (see [11][12][13][14]). Indeed, this was the aim of a crowd-sourced challenge recently proposed by IBM Research and Sage called DREAM Olfaction Prediction Challenge. The challenge resulted in several models that were able to predict pleasantness and intensity as well as 8 out of 19 semantic descriptors (namely "garlic", "fish", "sweet", "fruit", "burnt", "spices", "flower" and "sour") with an average correlation of predictions across all models above 0.5 [15].
Although these investigations brought evidence that chemical features of odorants can be linked to odor perception, the stimulus-percept problem raised a number of issues. For instance, the stimulus-percept relationship is generally viewed as bijective in that one physicochemical rule describes or predicts one quality. However, some cases suggest the existence of more than a single rule to relate chemistry and perception. Indeed, chemicals belonging to different families can trigger a "camphor" or a musky smell [16]. On the other hand, a single chiral center can render a compound odorless or shift its perceived odor completely, as is the case for (+) and (-)-carvone [17]. These examples strengthen the notion that the connections between the chemical space and the perceptual space are subtler than previously thought with multiple physicochemical rules describing a given quality. At best, the bijective SOR rules may be only be applicable to a very small fraction of the chemical space, with the remaining part of the perceptual space being best described using a multiple rules approach. The complexity of available databases, they include both thousands of chemical properties and a large heterogeneity in perceptual descriptions, [18][19][20][21] means that the manual generation of multiple rules is not feasible. In other words, to better understand the stimulus-percept issue in olfaction, there is a clear need to extract knowledge automatically and in an intelligible manner. Such an approach is positioned upstream of predictive modeling since it will enable modeling that extracts descriptive rules from the data that link subgroups belonging to both chemical and perceptual spaces. The main aim of our study was to develop such a computational framework to discover new descriptive structure-odor relationships.
To achieve this, we first set up a large database containing more than 1600 odorant molecules described by both physicochemical properties and olfactory qualities. We then developed an original methodology based on the discovery of physicochemical descriptions distinguishing between a group of objects given a target or class label, namely odor qualities. This approach has been widely studied in Artificial Intelligence (AI), data mining and machine learning. Specifically, supervised descriptive rules were formalized through subgroup discovery, emerging pattern/contrast-sets mining [22]. In all cases, we face a set of objects associated with descriptions and these objects are related to one or several class labels. This new pattern mining method, a variant of redescription mining [23], allows the discovery of pairs consisting of a description (of physicochemical properties) and a label (or sub-set of labels, olfactory qualities). The strength of the rule (SOR in our application) is evaluated through a new qualitycontrol measure detailed in the Methods section.

Olfaction database
We designed and set up a database describing odorant molecules by both their perceptual and physicochemical properties. Here, data from different sources were extracted and grouped: (i) for odorant identification and olfactory qualities, we referred respectively to the PubChem website (https://pubchem.ncbi.nlm.nih.gov/) and the textbook by Arctander [24]; (ii) for physicochemical properties, we referred to the Dragon software package (http://www.talete.mi.it/ index.htm).
Olfactory qualities were thus gathered from the book "Perfume and Flavor Chemicals", published in 1969 by Steffen Arctander. In this book, Arctander gives a complete description, including olfactory and trigeminal qualities as well as flavors, of 3102 odorants (detailed physicochemical properties of 1689 odorants among these 3102 odorants were retrieved, see below). These odorants were further identified by chemical name, molecular weight and corresponding olfactory qualities. Here, the 74 olfactory qualities selected by Chastrette and colleagues [25] were used as a reference list. These qualities were selected in a study of the whole of Arctander's book by excluding those that did not provide qualitative olfactory information and those that were the least frequent.
Note that before selecting this source, we ran a comparison with other existing Atlases and websites used for research, teaching and applicative purposes: specifically, the Dravnieks Atlas [26], the Boelens Atlas (see [27]), and the Flavornet website (http://www.flavornet.org). These sources (atlases, book and website) were compared along a series of parameters (the comparison took into account all odorants for which we collected CID numbers). The first parameter of interest was the number of molecules studied in the source, and was respectively 1689, 138, 263, and 660 for the Arctander, the Dravnieks, the Boelens and the Flavornet (here, only molecules for which we found a PubChem Compound Identification or CID are taken into account). The second parameter was the number of evaluators (and their expertise level) who smelled the compounds and provided the olfactory qualities: one trained evaluator for the Arctander, a large panel of evaluators for the Dravnieks (although there seems to be a large heterogeneity in the expert profile of these panelists, and little information as to the extent of training that panelists were given), six trained evaluators for the Boelens, and no information is given regarding the panelists for the Flavornet website. Third, when considering the way olfactory qualities were collected in the source, both the Arctander and the Flavornet used a binary format (presence/absence of quality), and both the Dravnieks and the Boelens used a scale of intensity or agreement. Fourth, we compared the number of olfactory qualities used in each atlas/book/website and observed the following distribution (the average number of qualities per molecule is in brackets): 74 (2.88) for the Arctander, 146 (29.99) for the Dravnieks,30 (12.86) for the Boelens, and 197 (2.72) for the Flavornet. Note also that the minimum (and the maximum) number of qualities for one molecule was: Arctander (min: 1; max: 10), Dravnieks (min: 5; max: 52), Boelens (min: 0; max: 22), Flavornet (min: 1; max: 5).
Thus, this analysis showed that whereas some sources are characterized by a large number of molecules (e.g. Arctander and Flavornet), others contain only a limited number of odorants (e.g. Boelens and Dravnieks). Moreover, there is great heterogeneity between these different sources with regards to the number and the degree of expertise of the evaluators. Some sources involve a large number of evaluators but with heterogeneous profiles (e.g. Dravnieks) and others involve a limited number of experts (e.g. Boelens and Arctander). Finally, whereas some sources have, on average, between 10 and 30 qualities per odorant (e.g. Boelens and Dravnieks), the average number is around three for others (e.g. Arctander and Flavornet). In view of these parameters, and because the descriptive approach used in this study requires a large database, we used the Arctander book because it contained the highest number of odorant molecules (1689) and a reasonable number of qualities per odorant (2.88 on average).
Odorant physicochemical properties were then obtained using Dragon, a software application that enables the calculation of 4885 molecular descriptors (Talete). Descriptors included in our dataset ranged from the simplest atom types, functional groups and fragment counts, to topological and geometrical descriptors. As Dragon requires 3D structure files, these were collected from the PubChem website (https://pubchem.ncbi.nlm.nih.gov) by using the compound identifier number of each odorant (CID). Individual odorant CIDs were obtained by using the CAS Registry Number and/or the chemical name of the odorant as an entry in the PubChem website. In total, 1689 CIDs were found for the 3102 odorants. In the following section, we study the set M of odorant molecules that are described by n physicochemical properties denoted F. Each property fi 2 F is a function that associates a real value with a molecule: fi: M ! image(fi) with image(fi) an interval of R. The olfactory qualities are denoted by O and class is a mapping that associates a subset of O to a molecule: class: M ! 2O.

The developed algorithm
Here, we developed an original subgroup discovery approach to mine descriptive rules that specifically characterize subsets of olfactory qualities (O). The specificity of this approach is intended to be able to extract rules with several olfactory qualities as targets, and also to treat unbalanced classes robustly, i.e., the fact that some olfactory qualities are very rare (e.g. "musty") compared to others (e.g. "fruity"). Subgroup discovery is a generic data mining method aimed at discovering regions in the data that stand out with respect to a given target.
We instantiated this framework in order to identify the conditions on some odorant physicochemical properties that are strongly associated with olfactory qualities. The molecules whose values on physicochemical descriptors belong to the intervals of the description D are members of the coverage of D: We count the number of molecules in the coverage with support(D) = |coverage(D)|. The quality of a rule is evaluated with respect to the olfactory qualities of the molecules in its coverage. First, the precision measure gives the proportion of the molecules of the coverage of D that also have (part of) the olfactory qualities Q: This is the percentage of times the rule is triggered for molecules whose qualities are in Q. On the other hand, it is also important to know if the rule covers all the molecules of quality Q. This is what the recall measure evaluates: These two measures behave in opposite ways: when one increases, the other decreases. One way to globally evaluate a rule is to use the F 1 measure, the harmonic mean between the precision and recall measures: As mentioned above, the olfactory qualities are more or less frequent in the data. To take that into account, the F β measure gives more importance to the precision measure for rare olfactory qualities, while favoring the recall measure for frequent qualities: Here, the terms xBeta and lBeta are determinant in choosing the appropriate sigmoid model, and are values that can be set by the experimenter. Given that, our approach aims to discover rules D ! Q whose support support(D) is greater than a threshold minSupp and with |Q| is lower or equal to a value maxQual. Those parameters make it possible to identify rules that are supported by sufficient odorant molecules, and also that are specific to a small set of olfactory qualities. The maxQual parameter enforces that the right-hand side of the rule contains a limited number of olfactory qualities to be interpretable by the analyst. Similarly, a max-Prop parameter allows to limit the number of (physicochemical) conditions in the left-hand side of the rules.
To illustrate the previous definitions, let us consider the toy olfactory dataset given in Table 1. This dataset contains 6 molecules identified by their IDs M = {1,2,3,4,5,6}. Each molecule is described by its molecular weight MW, its number of atoms nAt and its number of carbon atoms nC, that is, F = {MW, nAt, nC}. Besides, the molecules are also associated with their olfactory qualities among O = {fruity, vanillin, woody}. Let us consider the description Its coverage is coverage(D) = {2, 3, 5, 6}. If we consider the odorant quality Q = {vanillin}, as there is 2 molecules of coverage(D) with this quality, the precision of the rule is equal to: As there are 3 molecules in the whole dataset with that quality, the recall of the rule is: Its F 1 measure is thus equal to: Detailed information regarding the principle of the algorithm are provided as S1 Text.

Olfactory dataset: 1689 odorant molecules described by both olfactory qualities and physicochemical properties
Our olfactory dataset includes 1689 molecules described by 74 olfactory qualities. The dataset is multi-labeled, each molecule being associated with one or several olfactory qualities. On average, each molecule refers to 2.88 olfactory qualities among the 74 possible labels. Moreover, the frequency of olfactory qualities across odorants is unbalanced: on average a quality is used in 65.79 molecules (standard deviation: 105.28), the maximum is reached for the "fruity" quality (used in 570 molecules), the minimum for musty (used in only 2 molecules).

Physicochemical properties: Selection and interpretation
With regard to the physicochemical properties, our original database contained more than 4000 physicochemical features. For the purpose of a rational approach where features can be interpreted on a chemical basis, we selected attributes that were relevant, but more importantly easily interpretable. This approach is strongly inspired by the so-called 3D-olfactophore, where such easily interpretable features computed on odorants sharing the same olfactory percept are gathered in the 3 dimensions of space. Such features are typically Hydrogen bond donor/acceptor, Aromatic cycle, Charged atom, etc. This methodology is typically useful for molecular scientists to learn about structure- property relationships and design new molecules which fulfill the properties of these olfactophores [28]. Here the features we used were a series of physico-chemical properties. Thus, we selected constitutional, topological and chemical descriptors that represent molecular features which can be easily interpreted and extrapolated for further predictive models. They include the following categories: constitutional indices (n = 29; ex. "Molecular weight"), ring descriptors (n = 7; ex. "Number of rings"), functional group counts (n = 40; ex. "Number of esters"), molecular properties (n = 6; ex. "Topological polar surface area"). To select these descriptors, we screened the whole set of descriptors proposed by Dragon. We carefully selected descriptors able to provide information interpretable by any molecular scientist. The cost of selecting interpretable descriptors is a reduction in the description of the dataset. To evaluate the loss of information on the variance of a given molecular dataset, descriptors were computed on a set of 2620 odorants provided by Saito and colleagues [29]. Finally, 347 descriptors remained after filtering the following: correlated (above 0.85), constant for the whole dataset (no variation across parameters), not available for the whole dataset. After the dimensionality reduction, our selected 82 descriptors accounted for 37.2% of the original variance. When choosing randomly 82 descriptors within this set of 347, the variance always falls below 25%, suggesting that our descriptors performed quite well at describing a molecular set with a certain degree of variability. Finally, when projecting the entire set of molecules on to the two first components of a PCA, the dataset remains well split and molecules were still distinguishable.

Physicochemical descriptive rules: Generation and selection
First, the physicochemical rules were generated for each of the 74 qualities based on the 82 descriptors. This was done using the following parameters: maxoutput (100), beamwidth (30), MaxQual (1), MaxProperties (8), max Supp (700), XBeta (110), IBeta (20), and four different minSupp (5, 10, 20 and 30) (see Methods section and S1 Text for a detailed definition of these parameters). Second, an algorithm search for the best rules or combination of rules (with a maximum of 12 rules) for each of the 74 qualities and the four different minSupp (from 5 to 30). At this stage, the rules or combination of rules were ranked as a function of their Precision. Here, to evaluate the best rule or combination of rules that can describe each quality, we calculated for each rule (or combination of rules) the distance (Euclidian) from the "ideal" situation defined as the data-point with an error of "0" (error was calculated as one minus precision) and the best recall (value of 1 in the y-axis, meaning that all molecules that belong to the quality are described by these physicochemical rules). The point(s) with the smallest distance was (were) selected as the best rule or combination of rules for a given quality.
From this selection, we built a list of rules and/or combination of rules for each quality (see S1 Table). We showed that around 90% of the olfactory qualities were described by 1 to 6 rules and 66% (49 qualities among 74) were described by 3, 4 or 5 rules (see Fig 3a). Moreover, for the same quality, different rules or combinations of rules were selected because their distance Stimulus-percept issue in olfaction to the "ideal" situation (recall: 1; error: 0) was the same (see an example in Fig 3b). Fig 3c shows an example of the chemical structure of the molecules described by the same quality (jasmine here) and rules/combinations of rules.
To further examine whether the generated physicochemical rules were specific to a given perceptual quality, in other words whether they provided a good and relevant model, we used Bootstrap confidence intervals to evaluate whether the generated F-measure of the rules/models was significative. Here, knowing that a given set of rules covers X molecules, we sampled 100,000 sets of X molecules (with replacement) and calculated the F-measure of each sample according to the studied quality. Next, the confidence intervals (CI: 99%) of these sets were computed. Afterwards, the F-measure of the set of discovered rules was compared to this CI. Results showed that for all 74 qualities, the F-measure was significant in that its value was outside (and greater) the CI at 99%.
Finally, to examine how the model built with 82 physicochemical descriptors performed compared to a model built with all 4000 descriptors, we calculated the F-measure for each quality (computed on the basis of all sets of rules) in both types of models. Results showed that, on average, the F-measure was significantly greater (p<0.0001) in the model with 82 physicochemical descriptors (mean = 0.592, SEM = 0.012) compared to the model with all 4000 descriptors (mean = 0.487, SEM = 0.011), reflecting that the use of a small but explicative and intelligible set of descriptors enhances performance.
To sum up, we provide here a computational framework that enables the automatic extraction, from a complex and heterogeneous dataset, descriptive rules linking subgroups in a chemical space onto subgroups in a perceptual space. As can be seen in Fig 3a, only 3 qualities could be best described by a single physicochemical rule whereas more than two thirds of the qualities needed between 3 and 5 rules to be described. When dealing with the confidence of the rules, a gradient was observed whereby some rules were associated with a good rate of recall and minimum rate of error, whereas other rules exhibited a lower confidence in describing olfactory qualities. Note that all the generated rules are available to the reader in S1 Table. The computational approach that we developed is available at the following address: https:// projet.liris.cnrs.fr/olfamine/

Interpretation of the physicochemical rules
Here, we analyzed some of the best-known qualities in the field of olfactory evaluation, namely "fruity", "floral", "woody", "camphor", "earthy", "spicy", "fatty". The analysis of the rules and combinations of rules (see S1 Table), shows that the number of rules is quite high for these qualities ranging from six (floral), seven (camphor, earthy), eight (spicy, woody), nine (fatty) to twelve (fruity). From a physicochemical point of view, translated into interpretable rules, the floral quality is characterized by either aromatic and strongly hydrophobic molecules or non-aromatic and moderately hydrophobic odorants. For camphor, molecules are rather small in size, moderately hydrophobic, and eventually cyclic. The earthy quality is characterized by moderately hydrophobic molecules with unsaturations. The spicy quality is characterized by rather rigid molecules, eventually aromatic. Woody quality includes hydrophobic molecules, rather not cyclic nor aromatic. For the fatty, the molecules have a larger carbonchain skeleton which is highly hydropobic with aldehyde or acid functions. Finally, for the fruity quality, molecules are described as having moderate hydrophobicity and being medium to large in size.
To push the interpretation further, we examined qualities associated with generated physicochemical rules with the highest level of confidence. Here, we attempted (i) to understand the rules based on a priori knowledge and (ii) to examine whether the rules could raise new scientific assumptions.
We analyzed a total of eleven qualities corresponding to the first quartile of the distribution of all rules. Based on the Euclidian distance to the "ideal" situation; 473 rules were generated by our analysis (see Fig 4). These qualities were: sulfuraceous, vanillin, phenolic, musk, sandalwood, almond, orange-blossom, jasmine, hay, tarry, smoky. . Sandalwood odorants are quite diverse and minor modifications within their structure can abolish the sandalwood note. The rules which are mined here correspond to models which are very simple and hardly capture the subtlety of this odorant family [28]. The description presented here corresponds to the prototypic beta-santalol structure which has a campholenic skeleton.
The "almond" quality was described by four rules  0<nHAcc<2.0]. These descriptions suggest that odorants evoking an almond-like quality are compounds bearing at least one oxygen and/or other hydrogen bond-accepting atom but also bearing an aromatic cycle. This means that the structure bears several unsaturations. These chemicals are thus relatively small and can be compared to the prototypical structure of benzaldehyde.

Validation of the physicochemical rules in novel odorants
To evaluate the validity of the generated physicochemical rules, we applied them to novel sets of odorants. For a given quality, we checked whether novel odorants that fulfill physicochemical criteria according to our descriptive model indeed evoked significantly more of the studied quality than novel odorants than do not fulfill these physicochemical rules.
To this end, we isolated from 4 different databases, 4 sets of odorants not present in the Arctander database and therefore not used to build the descriptive rules. These databases were from the Dravnieks study [26] (n = 45; i.e. 45 odorants not present in our original dataset could be used), the Boelens & Harding study [30] (n = 56), one set from the Keller et al. study [15] (n = 118), and one set from the Licon et al. study [31] (n = 19). Within each of these four novel sets, olfactory quality was coded using a continuous variable (Dravnieks: from 0 to 100; Boelens & Haring: from 0 to 9; Keller et al.: from 0 to 100; Licon et al.: from 0 to 100). Note that, for the Keller et al. study, perceptual data were provided for 2 levels of odorant concentrations (« High » and « Low »).
Our descriptive model was tested in qualities that were common between the Arctander database and these four different databases. Moreover, for statistical purposes and for a given quality, only when the rules were filled for at least five odorants, comparisons were performed between odorants that filled the criteria for the rules and those that did not filled the rules. The qualities that satisfy these criteria were: 1/ for the Dravnieks study: Woody (n = 5), Camphor (n = 5), Earthy (n = 5), 2/ for the Boelens & Haring study: Woody (n = 10), Fruity (n = 9), Green (n = 8) and Balsamic (n = 5), 3/ for the Keller et al. study: Fruity (n = 15), and Sulfuraceous (n = 16; which was compared to a semantically proximal perceptual quality present in the Keller database, namely « Decayed »), and 4/ for the Licon et al. study: Camphor (n = 5).
Results are presented in Fig 5. Within each set, an analysis of variance (ANOVA) comparing perceptual values for a given quality for odorants that fulfill the physicochemical rules (Rule (1), black bars) vs. those that did not fulfill the rules (Rule (0), grey bars) was performed. For the Dravnieks dataset, the statistical analysis revealed that odorants that fulfill the rules for woody, earthy and camphor, were respectively perceived as significantly more woody (F(1,43) = 14.19, p<0.001, η 2 = 0.248 ; Fig 5a.i), earthy (F(1,43) = 6.128, p = 0.017, η 2 = 0.125 ; Fig 5a.ii) and camphoreous (F(1,43) = 28.63, p<0.001, η 2 = 0.400 ; Fig 5a.iii). In the same line, a significant increase in camphor quality was observed for odorants that fulfill the rules for this quality in the Licon et al. dataset (F(1,17) = 6.804, p = 0.018, η 2 = 0.286; Fig 5b). Validation was also observed within the Boelens & Haring dataset, but the results were more mixed. Whereas a significant increase was observed for woody (F(1,54) = 88.47, p<0.001, η 2 = 0.621 ; Fig 5c.i) and balsamic (F(1,54) = 15.86, p<0.001, η 2 = 0.227 ; Fig 5c.iv) in odorants that fulfill the physicochemical rules for these respective qualities, this was not the case for the green quality (F(1,54) = 0.227, p = 0.636, η 2 = 0.004 ; Fig 5c.ii). On a descriptive level , Fig 5c.iii shows that odorants that fulfill the physicochemical criteria for the quality fruity seem to be perceived as more fruity, but this was not significant (F(1,54) = 1.989, p = 0.164, η 2 = 0.036). However, when considering the Keller et al. dataset, validation was reached for fruity: odorants that fulfill criteria for the fruity quality were perceived as more fruity (for both low (F(1,116) = 9.219, p = 0.003, η 2 = 0.074 ; Fig 5d.i) and with high levels of concentrations (F(1,116) = 11.76, p<0.001, η 2 = 0.092 ;  Fig 5d.ii)), than odorants that did not fulfill the rules. The statistical analysis of this dataset shows also that odorants that fulfill the physicochemical criteria for the quality sulfuraceous were perceived as more decayed at both low (F(1,116)  To sum up, the present validation involved four sets of stimuli for a total of 238 odorants. It allowed us to test the descriptive model on seven perceptual qualities and for six of them (woody, earthy, camphor, balsamic, sulfuraceous, fruity), the rules generated by our model have been consistent with the ratings provided in these independent datasets.

Discussion
The interaction between the odorant molecule and the olfactory receptor(s) induces a percept called "odor". Chemists have previously attempted to characterize this phenomenon by working to obtain descriptive and/or predictive rules connecting physicochemical properties to odors [32]. Such is the case with olfactophores or the exploitation of more specific molecular features for predicting intensity or pleasantness [8,11,33,34]. Recently, a large database of compounds as well as a large number of human panelists were used in order to predict percepts, intensity and pleasantness [15]. In our study, we also considered that, to a certain extent, the odor quality of a molecule is encoded in its chemical structure. Our aim was to provide a descriptive model of the relationship between molecules and their perceived odors. To achieve this aim, we set up a new computational framework that considers the scientific assumption that, rather than relying on single physicochemical descriptions, the relationship between the chemical space of odorants and the perceptual space of odors should be examined through multiple descriptions. We developed a new method based on a subgroup discovery algorithm to mine descriptive (physicochemical) rules characterizing specific subsets of class labels (olfactory qualities). Thanks to this data-mining approach, we were able to provide new descriptive structure-odor rules with a gradient of confidence (taking into account both the recall and the precision) that varied from one quality to another. Validation of these descriptive models was achieved for a series of olfactory qualities associated with rules with medium levels of confidence (woody, earthy, balsamic, fruity) to higher levels of confidence (sulfuraceous and to a less degree camphor).
Our findings contribute to a better understanding of the olfactory system by elucidating the relationships between the chemistry and the psychobiology of smells. Indeed, the function of the olfactory system is to detect and discriminate volatile environmental molecules in order to make sense of them. This implies the construction of dedicated percepts that can influence behavior. In order to understand this system, relating the worlds of chemistry and perception is a requirement. Our findings provide descriptive elements of responses and highlight the physicochemical rules that describe olfactory perceptual qualities. Beyond these aspects, our algorithm would benefit from a more systemic approach through the inclusion of neurobiological representational states, ranging from olfactory receptors and olfactory bulb to primary and secondary olfactory areas. This will allow us to better understand how the interaction between the chemical features of odorants and olfactory receptors is mediated and processed in the brain to build olfactory percepts.
One question that may be raised from the current finding is how our descriptive approach is different from other machine learning methods and how it may help chemists and neuroscientists interested in olfaction solve scientific issues? In contrast to classical predictive machine learning tasks where the goal is to turn the data into an as-accurate-as-possible prediction machine, exploratory data analysis such as ours aims to automatically discover new insights about the domain in which the data was measured (e.g., olfaction). To this end, the notion of interpretability is fundamental as it is the premise of descriptive rules. Indeed, these rules are composed of conjunctions of conditions on attributes that conclude on some olfactory qualities. In contrast to black-box models, these rules, assessed by intuitive and mathematically well-funded measures are easy to assimilate for a domain expert. This, in turn, makes development of new hypotheses possible. In sum, our data-mining method should be regarded as an approach that can extract knowledge from a dataset characterized by its complexity, size and heterogeneity. Our approach is therefore situated at the upstream of any hypotheticaldeductive approaches. The generation of descriptive rules allows researchers to start such a hypothetical-deductive approach, and to formulate new scientific assumptions, to establish an experimental methodology and finally to develop and test the validity of predictive models. Our algorithm has made it possible to extract significant knowledge about a series of olfactory qualities. First, qualities with a chemical terminology (sulfuraceous, vanillin, phenolic) have a great reliability in the rules generated. These rules contained expected attributes such as the presence of sulfur atoms to describe "sulfuraceous odors", suggesting that our algorithm was efficient in extracting relevant and meaningful knowledge. Our results went beyond the sole description of these expected physico-chemical attributes. The generated rules contained also unexpected features such as "phenolic odors", where the presence of moderate size molecules, with few unsaturations and low hydrophilicity were put forward. Structure-odor relationships for some qualities such as musky [35,36], sandalwood [37][38][39] and to a lesser degree almond and jasmine [40] have already been explored in the past. Our descriptive model could bring new information for most of these qualities, thus enabling the testing of innovative hypothesis in the field. Importantly, we revealed the existence of descriptive rules for qualities that have not, to the best of our knowledge, been investigated before. These qualities include orange-blossom, hay, tarry and smoky. The generated rules will help scientists to better understand the chemical composition of the stimuli that evoke these odors and bring new insights about the way these molecules can interact with the olfactory system at the receptor level. Last but not least, our approach showed also that it was difficult to generate reliable rules for some qualities, particularly the most represented in the database (e.g., fruity, floral and woody). Although the recall associated with these rules was not high, they were characterized by a low rate of error, and validation was achieved for some of them including the well-known fruity and woody qualities. Finally, it is noticeable that a series of interesting qualities were described by rules with a good level of confidence but may be not precise enough to warrant detailed interpretation at this stage. These qualities are those that belong to the second quartile (Fig 4) and include, for instance, camphor for which validation with novel odorants was performed using two different external datasets.
A methodological issue that may be raised from our study relates to the choice of Arctander's book in our methodology. Before answering this question, one must detail why such linguistic sources are used in olfactory research. In general terms, whereas emotional reactions are very prominent in olfaction [41], lexical and linguistic processes are relatively limited: spontaneous odor identification performances are around 50% (see [42]). Such an absence has led scientists and those in the industry to develop different sources (atlases, books, websites) listing the olfactory qualities of a series of odorant molecules (Arctander book [24], the Dravnieks Atlas [26], the Boelens Atlas [27], and the Flavornet website (http:// www.flavornet.org)). A comparison of these sources led us to consider the Arctander's book since it contained the highest number of odorant molecules and a reasonable number of qualities per odorant. The book, in being developed by a single scientist, gave the advantage of allowing us to integrate more homogeneous data with less variable response profiles than those collected in other atlases. However, this same feature also opens up the possibility that certain odorants that evoke a given quality could be missed. One should therefore not discard the possibility that certain molecules that evoke, for instance, the quality "fruity" were not considered by our model in the validation phase because they were just below the perceptual threshold set by Arctander for that particular quality. Given the variability of olfactory perception between individuals, it is conceivable that the same quality of "fruity" could have been the perceptual threshold of another rater. As a consequence of these factors, we face a double challenge: on one hand, there is a clear need to implement some flexibility in olfactory databases, whereby a given molecule can be described by one or several qualities with an associated level of confidence instead of a binary response. On the other hand, in order to account for interindividual variability in olfactory perception, olfactory databases need to consist of data from a large number of individuals. Future work will need to overcome these factors, for example, by asking raters to provide a level of confidence alongside each response, or by using a fuzzy logic algorithm in order to provide the model with responses ranging in quality from not at all plausible to extremely plausible. In this way, our model will benefit from a better characterization of olfactory percepts, as the rules generated would be more suited to the complexity of human perception. On a more general front, one interesting perspective in this research field would be to implement a new Atlas that integrates response diversity accompanied by all the strengths present in each individual atlas (see Methods section; large number of molecules, large panel of evaluators in the qualitative description of each odor). Such an atlas could serve as a basis for a large number of: (i) fundamental research studies (to better understand the perceptual olfactory space and its relation to the chemical space and the neuronal space), (ii) applied research studies (to better understand the olfactory properties of new compounds developed by the perfume and flavor industry), (iii) education and teaching actions (to standardize olfactory learning procedures in perfume schools or culinary arts schools).
To sum up, current psychological and biological models of olfaction consider that olfactory perception is not totally universal. Although the sense of smell includes invariant aspects, a wide range of olfactory responses are characterized by their diversity from one person to another. In other words, while some molecules can induce very similar behavioral responses and perceptions among individuals, other molecules induce diverse perceptions, not only between individuals but also within the same person according to physiological and cognitive factors. It is undoubtedly in the invariant part of olfaction that we can establish the best predictive models linking chemistry to perception. In this case, a model including bijective rules can even be considered. Nevertheless, the more one moves towards the area of perceptual space of odors that is characterized by its heterogeneity between individuals, the higher the predictability threshold (i.e. bad prediction) becomes. This variability characterizes what could be called "the glass ceiling of olfactory diversity". New methods are thus needed to break or circumvent this glass ceiling. Such methodology should integrate the notion of multiple rules for linking the chemical space to these diverse perceptions. Our approach is providing some new elements to this challenging issue.
In conclusion, the present findings provide two important contributions to the fields of computation and neurosciences. First, although direct SOR seems illusory for some olfactory qualities if additional protagonists of the sense of smell are not taken into account, our approach suggests that descriptive rules exist for some qualities. Second, the present approach showed that several sub-rules should be taken into account when describing structure-odor relationships. From these findings, by correlating the multiple molecular properties of odors to their perceptual qualities and evoked-neural activities, experts in neuroscience and chemistry may generate new and innovative hypotheses in the field. In terms of application, this work can add to our knowledge of the complex phenomenon of smells and tastes. Indeed, by implementing such a descriptive structure/odor model within a dedicated data-analytics platform we could improve our understanding of the effects of molecular structure on the perception of those objects with highly-valued odorant properties such as foods, desserts, perfumes and flavors. This, in turn, would enable the optimization of product formulation with respect to the needs and expectations of consumers.
Supporting information S1 Text. Information about the algorithms developed for the discovery of structure odor rules.
(DOCX) S1 Table. List of rules and/or combination of rules for each olfactory quality. (XLSX)