Some methods of obtaining quantitative structure-activity relationships for quantities of environmental interest.

Methods are described for obtaining quantitative structure-activity relationships (QSAR) for the estimation of quantities of environmental interest. Toxicities of alkylamines and of alkyl alkanoates are well correlated by the alkyl bioactivity branching equation (ABB). Narcotic activities of 1,1-disubstituted ethylenes are correlated by the intermolecular forces bioactivity (IMF) equation. When the data set has a limited number of substituents in equivalent positions the group number (GN) equation, derivable from the IMF equation, can be used for correlation. It has been successfully applied to aqueous solubilities, 1-octanol-water partition coefficients, and bioaccumulation factors and ecological magnifications for organochlorine compounds. A combination of the omega method for combining data sets for different organisms with the GN equation has been used to correlate toxicities of organochlorine insecticides in two species of fish. Toxicities of carbamates have been correlated by a combination of the zeta method and the IMFB equation. The ABB and the GN equations are particularly useful in that they generally do not require parameter tables, and that the parameters they use are error-free. The methods presented here, as shown by the examples given, should make it possible to establish a collection of QSAR for toxicities, bioaccumulation factors, aqueous solubilities, partition coefficients, and other properties of sets of compounds of environmental interest.


Introduction
Two problems of major interest to environmental scientists are the prediction of chemical toxicity and of properties such as bioaccumulation. The most effective method for making estimates of these and similar quantities is correlation analysis. It requires a minimal amount of time, is low cost, and the statistics can easily be obtained using microcomputers and readily available programs. The method is generally applicable to the modeling of the variations of chemical properties and reactivities, physical properties and biological activities with change in molecular structure.
The basis of correlation analysis is the assumption that chemical, physical or biological properties of members of a data set of interest are a linear function of some property of structurally similar members of a reference data set. The reference set is generally used to define a parameter. Values from the data set of interest are then correlated with these parameters by means of simple or multiple linear regression analysis. *Chemistry Department, School of Liberal Arts and Sciences, Pratt Institute, Brooklyn, NY 11205.

Electrical Effects
Chemical reactivities and physical properties in solution can usually be completely described in terms of electrical and steric effects. It is convenient to operationally define two classes of electrical effects: 1 localized (field and/or inductive) electrical effects and 2 delocalized (resonance) electrical effects (1). They are represented by the a, and uD parameters respectively. Four different types of UD constants are required. We may write the structure of a data set in the form XGY where X is a variable substituent: Y is the active site, an atom or group of atoms at which some quantifiable phenomenon occurs; and G is a skeletal group to which X and Y are bonded. Y and G are usually held constant throughout the data set. The type of UD constant required to model the delocalized electrical effect in a given system depends on the nature of both G and Y (1,2). When both X and Y are bonded to sp1 hybridized C atoms in G, no delocalized electrical effect is observed, and the electrical effect is dependent only on UD. When X is bonded to C atoms hybridized SpN with 1 -n -2 and Y is bonded to a C atom hybridized sp3, the CrR constants give best results. When both X and Y are bonded to C atoms hybridized sp ', there are three pos- The subscript n equals PD and therefore describes the composition of the constants.

Steric Effects
Steric effects are conveniently described either by parameters bases on van der Waals radii (r,) or by the branching equations (3)(4)(5)(6). The vu steric parameter is defined by the equation x=rvx -rvH = rvx - 1.20 Substituents whose steric effect is conformationally dependent cannot be characterized by a single steric parameter as the steric effect they exert will depend on the steric requirements of the active site and of the process undergoing study. Groups of the type MZ1,Z2 and MZ1Z2Z3 such as CMe2j and CHClMe have conformationally dependent steric effects. One way of representing the steric effects of groups of this type is to define several different sets of effective v values for use with different processes. Sets of u' and v* values have been defined for this purpose (7,8). Alternatively, the branching equations can be used. The simple branching (SB) equation is given by the expression QAk = Iaini + aO i =1 (5) where Q is the quantity to be correlated and Ak represents an alkyl group; ni is the number of branches at all atoms in the alkyl group labeled i and is equal to the number of atoms labeled i + 1 (Fig. 1); a, and a0 are coefficients. The SB equation has been applied successfully not only to alkyl groups but to perfluoroalkyl groups (9) and to amino acid side chains, assuming that N, 0 and S atoms exert the same steric effect as do C atoms (10). H atoms are considered to have a negligible effect. The SB equation has some major disadvantages. Of particular importance is the assumption that all of the branches at a given atom exert the same steric effect. This is only an approximation, and often not a very good one. The equation is also not capable of dealing with planar T-bonded groups. The branching equations are designed for use with atoms that are sp3 hybridized. Cycloalkyl groups can be handled by calculating effective ni values for them. In so doing one of the major advantages of the branching equations is lost, the fact that for alkyl groups in particular and for tetrahedral atoms in acyclic groups in general ni are exact error free parameters. The expanded branching (XB) equation, Eq. (6), permits a more accurate representation of steric effects because it distinguishes between the first, second and third branches at an atom. The XB equation takes the form i 3 QAk = Eaini + aoo i=i *J=l (6) where the subscript i designates the i-th atoms in the group and the subscript j designates the branch at the i-th atom. Again, ai, and a,, are coefficients; ni, is equal to the total number of atoms labeled (i + 1) j as is shown in Figure 2. Relative to the SB equation the XB equation has two disadvantages: It requires a much larger data set for good results because of the larger number of independent variables, and, as n11 = 1 for all alkyl groups except Me, a value of a,, generally cannot be determined directly. Another steric parameter which can sometimes be usefully incorporated into the SB and XB equations is nb, defined as nb = nM (7) where nM is the number of atoms in the longest chain of the substituent. nb is a measure of group length.

Transport Parameters
In a series of papers which represent the most important advance in the quantification of biological activities up to the present time Hansch and his co-workers (11)(12)(13)(14) found that an equation of the form BAx = PUx + Svx + T,Tx + T2T2X + h (8) is generally applicable to biological activities. p, S, T1, T2 and h are coefficients determined by correlating some data set of interest with Eq. (8) by means of multiple linear regression analysis. The a and v constants have been described above. T is a transport parameter. It is convenient to classify transport parameters as primary or secondary. The former include the logarithms of the partition coefficient P, the molar solubility (Sol), and the chromatographic flow rate R,. The latter include Tr, defined as 7rx = log Pxlog PH (9) where Px and PH are the partition coefficients for the X-substituted and the unsubstituted compounds, respectively; and RM, defined as Transport parameters are a function of the differences in intermolecular forces (imf) between the substrate and phase 1 and those between the substrate and phase 2.
The intermolecular forces of interest together with the parameters used to model them are set forth in Table  2. Consider a data set of transport parameters that has a variable substituent X. It is possible for X to exert a steric effect on the solvation of functional groups that are in proximity to it. In modeling the transport parameter it is therefore necessary to include a term which represents the steric effect of X. Then using the parameters given in Table 2 and the steric parameter v we obtain the intermolecular force equation (IMF), Qx = Laux + DoDx + Aotx + HlnHx + H2nnx + Iix + Svx + Bo (12) where Q is some transport parameter. In place of the term Six either the SB or the XB equation may be used, giving the alternative relationships Qx = Luix + DUDX + Aax + HlnHX + ++ + H2n,,x + Iix + E aini + ao i=i1 (13) and Qx = Luix + Dax + Aoxx + HjnHX ,II 3 + H2n,,x+ Iix+ E Eaiini + a(,, (14) h-i j=1 When the substituent X is bonded to an sp3-hybridized C atom, the term in aD drops out. Transport parameters of amino acids have been successfully correlated with the equation (15): (15) Values of log P and -r for Ph(CH2),,X, PhX, XC6H4O2CNHMe, XC5H4NO2, and XC5H4N have been correlated with Eq. (12) or relationships derived from it (15,16). If X is restricted to alkyl groups, in which case a, and aD are constant while nH, n, amd i are equal to zero, and a is a linear function ofthe number of carbon atoms in the group (nc), Eqs. (13) and (14)  The observed value seems very high, possibly due to very low water solubility of this compound. n, is significantly linear in n, (partial correlation coefficient 0.729, confidence level 95.0%). As a2 was not significant, n, was excluded from the correlation, giving log LD;0 Ak = -0.192(+0.0654)nj -0.101(±+0.0200)n* -0.0201( _0.0111))>K' + 0.909(+0.0875) with F = 10.66 s = 0.0993 100R2 = 88.88 n = 8 (21) The dependence on n,? is borderline, probably due to the small size of the data set.
Oral LD50 values in the rat for alkyl alkanoates (23), again converted to mmole/kg, were correlated with a modification of the ABB equation, Qkkn = acEn*c + aC2>nc p m + E: ak,ink,i + aO k=l = (19) As nc and n2C are normally highly collinear, it is useful to rescale nc thereby breaking the collinearity (20,21).
We define n*c as (20) n* = nC -(nCmax + ncmin)/2 where nCmax and nC,min are the maximum and minimum values of nc in the data set. The ABB equation has been successfully applied to the toxicities of compounds of the type AkY where Y is a group such as -OH, -0-, -OPO(OEt)2, -SPO(Et)2, and -N(NO)- (22). Oral LD50 values in the rat (mg/kg) for alkylamines were converted to units of mmole/kg and correlated with the ABB equation with m = 2. Observed and calculated values are reported in Table 3 (set T51). Best results were obtained on exclusion of the data point for dodecylamine.
The index k identifies the alkyl group. Thus n2,, refers to the number of branches at C' in alkyl group 2, for the alkyl alkanoates (I).

The IMF Equation
The IMF equation [Eq. (12)] and relationships derived from it have been used to correlate log (per cent inhibition) of Gly-Leu hydrolase by amino acids (24) and binding of substrates to biopolymers (16). Narcotic activities of 1, 1-disubstituted ethylenes as measured by the concentration required for 50% of white mice to fall over on one side in a 2-hr exposure were correlated with a relationship derived from the IMF equation, log BAXlx2 = Llu1x + D$YDX + Alaox + H2Annx + SUvx1 + S2VX2 + Bo (24) As in this data set L -D the composite electrical effect constant Cr5O was used in place of ul and UR. S1 and S., were not significant and therefore the terms in u,i and iX2 were dropped.  (26) where m is the number of groups X. For a given substituent X all of the substituent constants are constant. Then, if m is varied Qxl,, (Lo1,x + DCDX + Aaox + HlfnHx + Hyn,x + Six)nm + B,, (27) or Qx,, = B1m + B0 (28) Then if there are l such X groups in the data set we may write (29) Qxi = > Bimx, + Bo i= 1 If the data set is limited to a few substituents and we assume that the skeletal group can be represented by a term in the number of carbon atoms, nc, we have Observed (25) and calculated values are set forth in Table 5 (set T102).
Qx, = lBimxi + acnc + B0 i=1 which is the group number (GN) form of the IMF equation. The limitation in the number of substituents is determined by the size of the data set as each substituent requires a separate independent variable and the data set must be large enough to provide a sufficient number of degrees of freedom. A group of compounds of major environmental interest is that which contains C, H, often Cl, and occasionally 0. Values of log P and of log KB (where KB is the bioaccumulation factor) for these compounds (26) Table 6 (sets P2 and Bi, respectively). Equations (32) and (33) make possible reasonably good estimates of log P and log KB for a wide range of arenes and organochlorine compounds from the empirical formula of the compound of interest. A very great advantage of both the ABB and the GN equations is that the parameters are error free. A given alkyl group has an exact number of branches at the i-th C atoms. A given compound has an exact number of Cl atoms or of0 atoms. Furthermore, parameter tables are not required for the use of these equations.  The species of fish used was Gambustia affinis (27). Results of correlation with Eq. (35) showed that nc and mc, are strongly collinear (partial correlation coefficient = 0.803; 98.0% confidence level). The term in nc was therefore dropped giving the correlation equation In this correlation only C, Cl, S and 0 atoms which were part of X were used to determine m. Thus, for example, when Xl = x2 = OMe mo = nc = 2 and mc, = ms = 0. Observed and calculated values are given in Table 7 (set B2). Aqueous solubilities of some organochlorine insecticides (27) were correlated with the GN equation in the form (37)

The Zeta Method
We now consider the use of a method which permits combining several data sets into one single large set.
To illustrate the method we may examine the reaction of a set of substrates XGY with some constant reagent R9. The reaction conditions including the temperature T, the pressure P, the solvent S,, and the ionic strength Is are also held constant. As was noted in the introduction the skeletal group G and the active site Y are held constant throughout a data set. Thus, though the quantity Q is a function of all of these variables Q = f(X, G, Y, Rg, T, P, Sv, Is) (39) all but X are normally held constant. If we write the most general correlation equation we obtain Q = L(JIX + DCUDX + SVX + gtG + YtY + rtRg + t4T + Pjp + S2AsV + S2A Is + ho (40)  Observed and calculated values of LD50 are set forth in Table 9 (set T401). It seems likely that the toxicity mechanism for IIIa is different from that for IIb and IIIc. Structurally, IIlb and IlIc are very similar to each other and very different from IIIa.

The Omega Method
Finally, it has been shown that when a set of compounds XGY has undergone biological testing in two or more organisms giving two or more data subsets these may be combined into a single data set on the condition that the same mechanism for the biological activity is extant in each subset (29). As the biological activity is generally due to the interaction of the substrate with some receptor site on a biopolymer and mutation easily can alter the receptor site the definition of reproducible parameters characteristic of the organism is difficult. It is best to resort to internal parameterization as described above. The method has been applied to LC50 toxicity data for organochlorine insecticides in two species of fish, rainbow trout (Salmo gairdnerii) and bluegill (Lepimus macrochirus) (27) using the GN equation in the form Qba,= acnc + B1mc, + B2m'Cl' + B3mo + Ow + Bo (45) where w is the organism parameter. The latter was defined in this data set as the Q value for heptachlor. Best results were obtained by excluding the value for endrin in bluegill from the correlation.

Conclusion
The methods presented here provide a means of obtaining quantitative structure property relationships of utility in environmental science and technology. They may be used to predict toxicity, bioaccumulation, aqueous solubility, partition coefficient, and other quantities of interest. The ABB and GN equations are particularly useful in that they do not require (except for cycloalkyl groups in the ABB equation) the use of parameter tables and are free of parameter error. Methods such as those presented here should eventually make possible reasonable estimates of environmental properties of interest for almost any chemical compound.