An Active Search Method for Finding Objects with Near-Optimal Property Values within a Given Set

This paper proposes an active search method aimed at finding objects with optimal or nearoptimal y-property values, on the basis of x-variables obtained by indirect, less costly methods. The proposed method progresses in a sequential manner, starting from a small subset of objects with known y-values. At each iteration, the K-nearest neighbour regression technique is employed to obtain estimates ŷ for the objects with unknown y-values. The object with best ŷ value is then subjected to a direct analysis procedure for evaluation of the y-property. Examples are presented with simulated data, as well as actual quantitative structure-activity relationship (QSAR) and near-infrared (NIR) spectrometry datasets. The QSAR and NIR case studies involve the search for maximal antidepressant activity in a set of arylpiperazine compounds and maximal pulp yield in a set of eucalyptus wood samples, respectively. In all these cases, the active search yielded results closer to the maximal y-value compared to the classical Kennard-Stone algorithm for object selection.


Introduction
In many analytical applications, the problem consists of finding an object with optimal or near-optimal value for a y-property of interest, within a given pool of objects.For instance, one may be interested in choosing the most appropriate drug from a certain family of compounds to obtain a desired therapeutic effect.Another example may involve the selection of plant specimens with suitable phenotypical features for breeding programs.However, the direct determination of y in all the available objects may not be a viable option, in view of the required resources, workload and/or time.As an alternative, estimates of y may be obtained in an indirect manner by using x-variables resulting from theoretical calculations or instrumental measurements.For this purpose, chemometric models need to be constructed on the basis of a subset of calibration objects with known y-values.The accuracy of the model predictions tends to be improved as more calibration objects are employed.However, the use of too many calibration objects escapes the purpose of using indirect methods, which consists of reducing the experimental workload associated to the direct determination of y.Ideally, the search for the optimal or near-optimal objects should be carried out with as few determinations of the property of interest as possible, in order to keep the workload to a minimum.
The present paper proposes an active search method to find objects with optimal or near-optimal y-values on the basis of x-variables obtained by indirect, less costly methods.The search algorithm progresses in a sequential manner, starting from a small subset of objects with known y-values.At each iteration, a multivariate weightedaveraging technique (K-nearest neighbour regression or KNNR) 1,2 is employed to obtain estimated ŷ values for the objects with unknown y-values.The object with best ŷ value (smallest in minimization problems or largest in maximization problems) is then selected as a likely candidate for optimality.This candidate object is then subjected to the direct analysis procedure for evaluation of the y-property.
The term active is used herein because the search method interacts with the analyst by indicating the objects that should be subjected to the direct analysis procedure.This concept bears similarity with active learning techniques, which have been studied within the scope of classification 3 and multivariate calibration tasks. 4,5However, such active learning techniques are aimed at the construction of classification or regression models, whereas the present work is concerned with an optimization problem, which does not necessarily involves the construction of an explicit model.
7][8][9] However, these works typically assume that the function f (x) can be evaluated for any vector x with admissible component values.In contrast, the present work concerns problems of chemometrics interest in which the possible choices for x are restricted to a finite pool of objects available for selection.
The proposed active search method is illustrated in a simulated example and two case studies with real data.The first case study concerns a quantitative structureactivity relationships (QSAR) investigation involving a set of antidepressant compounds (arylpiperazine) with inhibition constants (K i ) measured elsewhere. 10In this type of problem, the x-values correspond to molecular descriptors of the compounds under consideration, which are derived through theoretical calculations by using a suitable software.In contrast, the y-property is obtained by experimental procedures such as in vitro or in vivo studies. 11he second case study involves a set of eucalyptus wood samples, with reflectance spectra acquired in the near-infrared (NIR) range. 12In this case, the property of interest is the pulp yield, which should be maximized for use in the pulp and paper industry.It is worth noting that the NIR spectra (x-variables) are acquired by using wood chips that can be easily extracted from the eucalyptus tree.On the other hand, the determination of the pulp yield (y-property) requires the use of a much larger wood sample.

Preliminaries and notation
Vectors are denoted by lowercase boldface letters and scalars by italic characters.The Euclidean norm of a vector x is denoted by || x ||.An empty set is represented by the symbol ∅.The union of two sets A and B is denoted by A ∪ B. The difference between two sets A and B is denoted by A \ B, i.e., the elements of A \ B are those that belong to A and do not belong to B. The notation is employed to indicate the argument i that maximizes g(i).A hat symbol (^) is used to indicate an estimated value.The x-vectors associated to the objects under consideration are denoted by x 1 , x 2 , ..., x N .It is assumed that different objects are associated to different x-vectors, i.e., x i ≠ x j for i ≠ j.
The proposed active search method is initialized by selecting n 0 objects on the basis of the x-vectors alone, i.e., without using any information concerning the corresponding y-values.This initialization is carried out by using the Kennard-Stone (KS) algorithm, which is a classical method for selecting objects in a near-uniform manner on the basis of the Euclidean distances between each pair of x-vectors. 13In the presentation given below, details of the KS algorithm are initially given, followed by a description of the proposed method.

Kennard-Stone algorithm
In the KS algorithm, the first two selected objects are those that are separated by the largest distance in the x-space.Each subsequent object is chosen according to a maximum-minimum procedure in order to avoid the selection of objects that are close together.A formal description of this algorithm is presented below, in the form of a sequence of steps.It is assumed that n 0 > 2 objects are to be selected from a pool of N objects.
Step 1: initialization of the index sets

Let
and denote the index sets of the selected objects and the objects still available for selection, respectively.Initially, no object has been selected and thus = ∅ and = {1, 2, …, N}.
Step 2: distance calculations Calculate the Euclidean distance d(i,j) = || x i -x j || between each pair of vectors x i , x j , for i, j = 1, 2, ..., N, with i ≠ j.
Step 3: selection of the two initial objects Let k 1 , k 2 be the indexes of the two objects that are separated by the largest distance, i.e.: (1) Update the index sets by moving k 1 , k 2 from to , i.e.: Let n = 2 (number of objects already selected).
Step 4: selection of the next object Let k n+1 be the index of the object with the largest distance with respect to the nearest object already selected, i.e.: Update the index sets by moving k n+1 from to , i.e.: Step 5 Let n = n + 1.If n < n 0 return to Step 4.
Step 6: end The index set of the n 0 selected objects is .

Proposed active search method
In what follows, it is assumed that the problem involves the search for the maximal value of y.Minimization problems can be handled in a similar manner by changing the sign of the y-values, i.e., by replacing y for −y.
The proposed method involves a sequential algorithm, which selects a new object at each iteration.It is assumed that the y-values have been experimentally determined for the objects already selected.The KNNR technique 1,2 is then employed to estimate y-values for the objects which have not been selected yet.The object with the largest estimated y-value is then selected.The algorithm can be described as follows, assuming that n 0 initial objects were selected by using the KS algorithm and a total of n max ≤ N objects are to be selected at the end of the active search.

Let
= {k 1 , k 2 , …, k n 0 } and = {1, 2, …, N} \ be the index sets resulting from the initial selection of n 0 objects according to the KS algorithm.Moreover, let {y k , k ∈ } be the y-values corresponding to the initial set of selected objects.Choose the number K of nearest neighbours to be employed in the KNNR technique, with K ≤ n 0 .Let n = n 0 .
Step 2: KNNR calculations Calculate an estimate of y for each object still available for selection, as follows: (7)   where ) is the index set of the K nearest neighbours of x i (in the sense of smallest Euclidean distance) among the objects already selected.
Step 3: selection of the next object Let k n+1 be the index of the object with the largest estimated value of y, i.e.: (8)   Update the index sets by moving k n+1 from to , i.e.: Step 4 Obtain the actual y-property for the selected object and store the resulting value in y k n+1 .
Step 5 Let n = n + 1.If n < n max return to Step 3.
Step 6: end The index set of the n max selected objects is .
It is worth noting that the equation 7 for the estimation of y-values is solely based on Euclidean distances in the x-space, which can be calculated even if the number of variables is larger than the number of objects already analyzed.This is an advantage over multiple linear regression modelling, which typically requires the use of variable selection techniques when the number of calibration objects is small. 14Moreover, there is no need to choose a parameter related to model complexity, such as the number of factors in partial-least-squares modelling. 15he only parameters that need to be chosen in the proposed algorithm are the number n 0 of initial objects that will be selected by the KS procedure, the total number n max of objects to be selected and the number K of nearest neighbours.In the tests carried out in the present work, with both simulated and actual data, a good compromise was obtained by setting n 0 = 10.Using fewer initial objects was not appropriate, since there was not enough information for the active search procedure.On the other hand, using more initial objects escaped the purpose of the proposed method, which is aimed at using as few objects as possible.In practice, the choice of n max will depend on the time and resources available for the analysis of the objects.In fact, by analyzing a larger number of objects, the chance of achieving the optimal value of the y-property increases.
Different values for K will be employed in the case studies, in order to investigate the effect of this parameter in the active search outcome.The results will be compared with a simpler version of the KNNR technique, in which all the objects already selected are employed in the estimation of y.In this case, equation 7 is employed with K = n at each iteration of the algorithm.Such an alternative has the advantage of dispensing with the need to choose a fixed number of nearest neighbours K.
Furthermore, a comparison will be carried out between the active search and the KS algorithm in order to show that better results tend to be obtained with the proposed method.It is worth noting that the KS algorithm is not specifically aimed at optimization, since it favours the exploration of the x-space in a global manner.However, the comparison presented herein is of value to illustrate and justify the search mechanism of the proposed method, which places emphasis on regions associated to larger y values.

Simulated example
The simulated example involved two variables x 1 , x 2 related to a y-property according to the following expression: (11)   which has global and local maximum points, as can be seen in Figure 1.A total of 150 objects were generated by taking random pairs (x 1 , x 2 ) with values in the range [0, 100].The problem consists of finding the object that is closest to the global maximum of y.

QSAR data set
Table 1 presents the structure and biological property values of the 81 arylpiperazine compounds employed in this investigation.The biological property corresponds to experimental values of 5-HT 1A receptor affinity stated in terms of the inhibition constant K i , as reported in. 10 In the present study, this property is expressed as pK i = -log K i , with values ranging from 5.3 to 8.3.
A geometry optimization of all compounds was, then, performed by using the semiempirical method AM1. 16rom these structures, 14 electronic descriptors were calculated at the AM1 level, as shown in Table 2.All these descriptors were assumed to represent electronic properties of the compounds.The descriptor values were autoscaled for use in the present study.The descriptors HL GAP , η and χ not were used in the subsequent calculations, since they bear redundant information with respect to energy of the highest occupied molecular orbital (E HOMO ) and energy of lowest unoccupied molecular orbital (E LUMO ).
A principal component analysis (PCA) using the default settings of the Unscrambler software (Camo Software AS., Oslo, Norway) was employed in order to identify compounds outside the applicability domain.As can be seen in Figure 2a, the compounds 20 and 41 (see Table 1) present leverage and/or residual variance values greater than the corresponding thresholds and were thus removed from the dataset.
In order to evaluate the proposed active search method, the 79 remaining compounds were employed to generate 150 different subsets by using a subsampling procedure.Each subset was created by randomly selecting 54 compounds.The proposed active search method was then applied to each of these subsets in order to find the compound with the largest pK i value in each subset.

Eucalyptus wood data set
This data set comprised 100 samples of wood chips from small eucalyptus tree logs, collected in different Vol.27, No. 7, 2016  plantations.The samples were initially ground in a knife chipper and finally in a UDY mill (UDY Corporation, Fort Collins, CO, USA) to a granulometry smaller than 0.50 mm.Prior to the spectral measurements, the samples remained for three hours in contact with the lab atmosphere with temperature and humidity controlled at 24 ± 1 ºC and 54 ± 2%, respectively.NIR diffuse reflectance spectra were obtained in the range of 1100 to 2500 nm, by using a Foss NIRSystems 5000 spectrophotometer (Silver Spring, MD, USA) fitted with a standard circular cell.Each spectrum was obtained as an average of 50 scans.The resulting spectra for the 100 samples are presented in Figure 3a.To circumvent the problem of systematic baseline variations, first-derivative spectra were employed by using a Savitzky-Golay filter with a second order polynomial and an 11-point window. 17igure 3b presents the derivative spectra, which were used in the active search case study.Pulp yield was determined by subjecting discs of wood from the eucalyptus trees (collected at a standard height) to kraft cooking.These samples were chopped and mixed with different alkali charges to achieve pulp kappa number 15.The results ranged from 48.7 to 55.8% (w/w).
As in the QSAR data set, the applicability domain was also evaluated by using PCA.As a result, the samples 10 and 51 were removed because their leverage values were found to be larger than the corresponding threshold (Figure 2b).A subsampling procedure was also employed to generate different subsets of objects from the remaining 98 wood samples.In this case, 150 subsets with 57 samples each were generated.The proposed active search method was then applied to each of these subsets in order to find the wood sample with the largest pulp yield in each subset.

Software
The electronic descriptors in the QSAR case study were calculated in the Gaussian 03 package software. 18All the other calculations were carried out using the Matlab R2012b software. 19

Simulated example
Figure 4 presents the results obtained in the simulated example with 150 objects described by (x 1 , x 2 ) pairs generated in a random manner.For better visualization, the corresponding y-values were normalized in the [0, 1] range, with 0 and 1 corresponding to the minimal and maximal values in the simulated data set, respectively.The results were obtained by using K = 1, 5 and 10 nearest neighbours in equation 7, as well as all the objects already selected, i.e., with K = n.As shown in Figure 4, the active search method arrived at the object with the maximal y-value after evaluating n = 98 objects (K = 1), n = 16 objects (K = 5) and n = 17 objects (with either K = 10 or all objects, since the same curve was obtained in both cases), including the first 10 objects selected by the KS algorithm.The poor result obtained with K = 1 can be ascribed to an inadequate estimation of the y-values: the estimate is simply set to y-value of the closest neighbour of the object under consideration.The results for K = 5, 10 and all objects were very similar.Therefore, using all objects can be considered a suitable option in this case, since it does not require the choice of a particular value for K.
Figure 5 compares the active search results (using all selected objects in the y-estimation procedure) with those obtained by using the KS algorithm alone.As can be seen in Figure 5a, the maximal y-value is only achieved by KS after the evaluation of n = 42 objects.An interpretation of these results can be provided on the basis of the remaining graphs in Figure 5.The 150 objects involved in this example are indicated in Figure 5b over a contour plot of the function associated to the y-values.The initial 10 objects selected by KS are plotted with markers connected by straight lines.As can be seen, this initial selection favoured an exploration of the x-space, with objects spread over the entire x 1 -x 2 plane.Figure 5c illustrates the first improvement provided by the active search after the KS initialization.As can be seen, after selecting 2 additional objects, an object with a larger y-value was found.This improvement corresponds to the first step in Figure 5a.In contrast, the KS algorithm continued to favour the selection of objects that are not close to each other, as shown in Figure 5d, which prevented the selection of objects close to the best one already found.The object with the largest y-value was found by the active search method after the evaluation of n = 17 objects, as shown in Figure 5e.For comparison, Figure 5f shows that KS still did not exploit the region associated to the larger y-values.respectively.In each case, the plots represent the average values for the 150 subsets generated by the subsampling procedure, as described in the Experimental section.As in the simulated case study, the y-values were normalized in the [0, 1] range and the results were obtained by using K = 1, 5, 10, as well as all the objects already selected, i.e., K = n.Again, the use of K = 1 yielded the slowest convergence to the maximal y-value, which indicates that larger values of K are required for an appropriate estimation of y within the proposed active search algorithm.The results for K = 5, 10 and all objects were similar, with a slight advantage of using all objects in the QSAR data set (Figure 6a).It is worth recalling that using all objects is a convenient option for the analyst, since the choice of particular value of K is not required.

Actual data sets
Finally, Figure 7 compares the active search results (all objects) with those obtained by using KS alone.As can be seen, the active search method provided a faster convergence to the maximal y-value in both data sets, as compared to the KS algorithm.

Conclusions
This paper proposed an active search method, which can be used to find objects with optimal or near-optimal y-property values with reduced experimental workload.The proposed method is based on the K-nearest neighbour regression technique, which does not involve assumptions on the linearity of the x-y relationship and is not adversely affected by multicollinearity problems among the x-variables.The number K of nearest neighbours employed in the estimation of the y-property is a design parameter that needs to be chosen by the analyst.However, the tests carried out in the present work suggest that this choice is not a critical factor for the application of the proposed method.Indeed, good results were also obtained by using a simpler version of the KNNR technique, in which all the objects already selected are employed in the estimation of y.
Examples were presented with simulated data, as well as actual QSAR and NIR spectrometry datasets.In all these cases, the active search yielded results closer to the maximal y-value compared to the classical KS algorithm for object selection.These results indicate that the proposed method is indeed of value to reduce the number of y-property determinations in the search for the optimal object.Future improvements of the proposed method could be concerned with the calculation of confidence intervals for the estimated values ŷ.Such confidence intervals may be of value to establish a stopping criterion for the active search.

Figure 1 .
Figure 1.Relationship between the x-variables and the y-property in the simulated example.

Figure 2 .
Figure 2. Leverage vs. residual variance plots for (a) quantitative structure-activity relationship (QSAR); and (b) eucalyptus wood data sets.The dashed lines correspond to thresholds obtained by using the default settings of the software package.

Figures
Figures 6a and 6b present the active search resultsobtained with the QSAR and eucalyptus wood data sets,

Figure 3 .
Figure 3. (a) Original; and (b) derivative near-infrared (NIR) diffuse reflectance spectra of the 100 eucalyptus wood samples employed in the study.

Figure 4 .
Figure 4. Active search results in the simulated example (the same curve was obtained for K = 10 and all objects).

Figure 5 .
Figure 5. Active search (all objects) and Kennard-Stone (KS) results in the simulated example.(a) Largest value of y (normalized) obtained after analyzing n objects; (b) contour plot of y-values with indication of the overall dataset (asterisk markers) and the 10 initial objects selected by the KS algorithm (markers connected by straight lines).The markers are connected by straight lines following the order in which they were selected.The blue square marker indicates the last selected object.(c), (d), (e), (f) Objects selected by the KS and active search algorithms with n = 12 and n = 17.

Figure 6 .
Figure 6.Active search results for the (a) quantitative structure-activity relationship (QSAR) and (b) near-infrared (NIR) data sets.

Table 1 .
Molecular structures and pK i values of the 81 arylpiperazine compounds employed in this investigation

Table 1 .
Molecular structures and pK i values of the 81 arylpiperazine compounds employed in this investigation (cont.)

Table 2 .
Electronic descriptors and their definitions