Neurobehavioral aspects of developmental toxicity testing.

Tests for detection of neurobehavioral changes in the offspring have been a regulatory requirement in developmental toxicity testing of drugs for almost 20 years. Keeping their purpose of hazard identification and risk assessment for humans in mind, investigators and agency reviewers have become deeply ingrained with some stereotyped behaviors with respect to such relevant issues as choice of animal species and data evaluation. Other problematic areas of study design and conduct, selection of litter representatives for testing, what methods to combine in a testing battery, and statistical treatment of results and their interpretation, will need more research and discussion in the future.


Introduction
The thalidomide tragedy in the early 1960s brought about a worldwide realization that drugs, pesticides, and other chemical substances have the potential to induce damage in the unborn child. This was followed by the introduction of new guidelines for preclinical testing requirements. Naturally, interest focused mainly on structural abnormalities, and testing strategies were devised that were expected to detect and characterize prenatal insults leading to gross morphological changes in the embryo and fetus. Although experience has shown that terata occur only rarely compared with other end points of developmental toxicity, such as effects on growth and viability, the attitude that malformations are all important still persists with many investigators and regulatory agency reviewers.
As early as 1963, a fourth area of concern, behavioral teratology, was introduced in a review by Werboff and Gottlieb (1) on the postnatal effects of prenatal X-irradiation and exposure to psychoactive drugs. However, regulatory action was not taken until 1975 when Great Britain and Japan incorporated requirements for developmental neurotoxicity testing into their respective guidelines for testing of medicinal products for reproductive toxicity. At this time, the relevance to humans of the potential of chemicals to induce damage to the central nervous system following prenatal exposure and exposure during childhood had become widely accepted based on the data that were available for organic mercury, lead, and alcohol. Although no validated methods were available, it was felt that early detection of a substance's potential for developmental neurotoxicity in animal experiments could prevent widespread exposure of pregnant women and thereby minimize or eliminate the risk for the growing and developing child. It was assumed that, for an unknown compound, where no clues about the possible localization of a potential lesion existed, functional tests might give greater sensitivity than histopathological and biochemical methods. The underlying biological mechanisms could then be elucidated by secondary studies from the functional changes observed in first-pass testing. Based on this rationale, testing of drugs for end points of developmental neurotoxicity commenced in the mid-1970s, but it was not until the early 1980s that behavioral testing batteries became established as routine tests in the pharmaceutical industry.
Since that time a large amount of data on tests and test combinations for medicinal products has accumulated in the archives of regulatory agencies and pharmaceutical companies, data that should be reexamined critically with the aim of identifying methods that may be recommended.

Developmental Neurotoxicity within the Framework of Regulatory Studies
It was clear from the beginning that, for therapeutic agents, neurobehavioral toxicity testing would have to be incorporated into existing study designs for the detection of any (adverse) effect on development. These would have to be adapted to allow the collection of information on functional changes, in addition to the data on viablity, growth, and gross structural abnormalities of conceptus and offspring (2). Regulatory studies for developmental toxicity do not detect these different end points equally well. They can be considered fairly sensitive for effects on viablity and general growth parameters (body weight); however, when the emphasis is placed on rare events like malformations, a study size of usually 10 to 20 pregnant animals per group will always be insufficient in picking up any but the strongest effects. Also, the different variables constituting an embryofetotoxic effect do not usually occur with an even distribution within and between litters of a dose group. Some litters may be free of any relevant findings, others may contain only one or a few affected fetuses or pups, or, alternatively, the whole litter may be abnormal. For nonfunctional end points these distributions can be determined without great difficulty. But what about effects on the functional integrity of the central nervous system? How can we be sure that the litter representatives chosen for testing (in most studies by random methods) do indeed carry an alteration? If individual distributions of functional changes are in any way similar to those existing for other end points of aberrant development, with only a selection of animals from each litter being tested, in routine studies we should expect to miss quite a few substances that affect CNS function.
Behavior may well be the most variable parameter of all the responses an organism can make to a developmental toxicant. This is logical because uncompromised animals will exhibit a wide range of complex, adaptive behaviors without having to compensate for substance-induced deficits, and relatively small changes in the environment, e.g., handling of animals, maternal influences, food restriction, and prior test experience, have been shown to produce alterations in normal behavior.

Developmental Neurotoxicity
Studies to detect developmental toxicity differ in several important aspects from tests in which toxicity is elicited in adults. The main difference is the impossibility of deriving untreated (control) as well as treated values from the same animals. Comparisons to determine whether there is a treatment effect always will be made between groups of animals that have different experiences related to exposure and, even in inbred animals, different genetic compositions. The developing nervous system may be more sensitive to toxic effects than that of the adult. Neurotoxic effects of a chemical may occur at lower doses than in adults, different functions may be affected, and substances that show no neurotoxic potential in adults may induce effects during development. Therefore, we may suspect a potential for developmental neurotoxicity if a compound induces neurotoxic changes in adults, but we have to be aware of the fact that compounds that do not affect the adult nervous system may very well do so in the developing organism. As testing is performed to detect the unexpected, it will be necessary to study end points for developmental neurotoxicity with all substances for which exposure of the embryo, the fetus, the newborn, the child, and the juvenile cannot be excluded definitely. Unlike malformations that can only be induced during a narrow time window in organogenesis, functional changes can be expected to occur during this differentiation phase and, additionally, for as long a period as the organ system needs to attain full functional competence. For detection of such effects, animals will have to be exposed during embryo-fetal development and postnatally through puberty. Observation, however, will have to continue for a longer time period, ideally to old age, to make sure that delayed manifestations are not overlooked. None of the study designs currently recommended by guidelines includes effects that may become apparent only in aging animals, e.g., premature onset of senescence or, with respect to CNS function, senility.

Animal Species
Rabbits, rats, and mice are the animal species primarily used in routine developmental toxicity testing; however, the potential of inducing neurobehavioral toxicity in the offspring is evaluated almost exclusively in rats (Table 1). This is due to the fact that regulatory agencies have accepted data from rats in cases in which this species proved to be an unsuitable animal model for the substance under study when they should have encouraged the use of another animal species. This practice reveals astonishing insights into how great an importance is attached to possible effects with postnatal manifestations in humans (including neurobehavioral findings) during the process of hazard identification and risk assessment. At present, we are making the world safer for rats. But how secure can we feel about the detection of hazards for the developing nervous system when this animal model is not even reasonably close to humans? Even if we do not yet know how changes in animal behavior may translate to the situation in humans, the least we can do is to use the most appropriate animal model available to us, i.e., that closest to humans with respect to metabolism, pharmacokinetics, pharmacodynamics, and physiology. If such a species cannot be found or used, we should consider conducting studies in more than one animal species. This has been standard procedure in the testing for structural abnormalities and still is, despite intellectual acknowledgement that one relevant species is better than two or more less suitable ones. If we want to increase the predictability of animal results for humans, it will be necessary to develop methods for species other than rats and to apply them in those cases where the rat is not a relevant model.

Methods
Certainly, none of the commonly used laboratory animal models can match the complexity of human behavior. For detection studies, the animal model, the testing situation, and the available methods provide the limiting factors and restrict investigators to analyzing basic neurological functions and simple behaviors. Even given these restrictions, there are more specific functions than we could hope to incorporate and test in a single, comprehensive study design. It may be considered advantageous that the first guidelines required testing of specific functions of the central nervous system but, for lack of experience, did not specify which tests were to be used. This has resulted in a diversification of methods and in a great variety of testing batteries that are in regular use today. It should be possible to identify sensitive and reliable tests with predictive value for the human situation. It is unlikely, however, that a single ideal combination of testing procedures could be defined-one that would cover all aspects of developmental neurotoxicity and that could be conducted at reasonable costs. Criteria for the selection of tests for a testing battery have been described (3). For detection of any (adverse) effect, preference is given to apical tests that require the integrated function of several subsystems. These may offer the best chance to discover whether the substance poses a hazard to development and function of the CNS based on the assumption that a change in any of the subsystems can lead to an alteration in behavioral output. On the other hand, with an increasing number of subsystems involved, the animal will have greater possibilities of compensating for deficits in one subsystem. The choice of methods should be aimed at having available a set of apical tests that are neither too complex nor too specific to incorporate into routine developmental toxicity studies and to supplement this battery with close observation of the animals. If these give indications for changes in behavioral end points, other more sophisticated tests can be used to clarify and characterize the results obtained by the base set. Testing batteries normally combine measurements of growth and physical development (2-6) with tests for the development of sensory functions, reflexes, and body control, and protocols for detecting changes in locomotor activity, Environmental Health Perspectives * Vol 104, Supplement 2 * April 996 learning/memory, and social/reproductive behavior. As it will not be possible here to describe methods in detail, the reader is referred to several comprehensive reviews on testing procedures and their respective merits (3)(4)(5)(6)(7)(8).

Reliability/Reproducbility
Regarding reliability of testing procedures, agency experience shows that most of the tests incorporated into testing batteries and retained by the investigators over the years can be considered standardized and validated with respect to intralaboratory reproducibility. Investigators do not tend to continue using tests that will give vastly different results from study to study, and it can be seen from the submitted reports that comparable values are found for control groups over time. Interlaboratory reproducibility has not been evaluated for all the tests used, but the results from the comparison of some of these methods in the study of the National Center for Toxicological Research (NCTR) have been encouraging (9,10).

Sensitity
Detection sensitivity of behavioral measurements has been evaluated for the methods used in the NCTR Collaborative Behavioral Teratology Study, and it can be stated that variability of the measured parameters will allow detection of effects if they are large enough (approximately 10-20% change from control). However, it would appear that neither this testing battery, nor any other in current use, could be relied on to detect a developmental neurotoxicant among a series of unknown compounds. One of the presumed positive control substances for the Collaborative Behavioral Teratology Study, d-amphetamine, later gave negative findings consistently within and across the participating laboratories (9). So either the assumption of d.amphetamine being a strong developmental neurotoxicant was wrong or the test battery was not suited to detect the deficits the substance did induce (11).
For a broader comparison of methods not only for detection of effects but also for characterisation, a European collaborative study group was initiated. During this study each participating group applied the methods used in their laboratory to the task of detecting neurobehavioral changes induced by a known positive. The outcome of this investigation shows that it is not necessary to work with a standardized set of methods to detect adverse effects on the behavior of offspring (12,13).

Comprehensiveness
Selection of a comprehensive testing battery is a crucial point, especially as it will not become apparent until much later whether the aim has been achieved. Although no consensus for recommending specific tests is in sight, there seems to be some general agreement on the functions that should be tested, namely, sensory systems, reflexes, neuromotor development, locomotion/activity, reactivity/habituation, learning/memory, and social/reproductive behavior. To integrate behavioral data into the context of other manifestations of developmental toxicity, data on physical development of the offspring have to be available. These commonly include data on body weights and postnatal weight gain, viability, physical landmark development and maturation. It has also been recommended to maintain records of organ weights, especially brain, functional observation battery results, neuropathologic examinations (14), and brain biochemistry (15,16).

Predictability
Little can be said about whether the tests in current use predict that similar (or different) effects on CNS development and function would be elicited in humans. They are able to identify known human developmental neurotoxicants, but it can be argued that this is due to selection bias and to the fact that we already know what to look for with these substances. Predictability could be evaluated by using data on new therapeutic agents, but lack of human data effectively prevents this.

Computerized Procdures vesus Human Observe
For a novel, unknown compound, detection of an effect will depend to a large extent on the observational skill and the knowledge of the investigators, who can do what a standardized test is unable to accomplish; that is, they can pick up unexpected effects by observation and verify them by specifically designed procedures. Most tests yield not only variables that can be measured exactly but also give rise to findings for which measurement is difficult or impossible and that will have to be observed and described.
A simple water maze, for example, which is part of many routine testing batteries, will be used to collect data on learning ability and memory. The parameters recorded routinely are whether the animal is successful within the time limit, the number of errors made, and the time needed to escape from the maze. Experience shows that most (all?) animals will learn the route that takes them to the exit easily once they have managed to discover (or have been shown) where it is situated. Probably this is not a very sensitive test for the detection of subtle differences in learning/memory functions, as the performance of rats is quite variable even in control groups, and the demands on the central nervous system of this simple task do not seem to be high enough to bring out clear effects on learning ability when brain damage is slight. In addition, the way the test is applied and evaluated, often only as a measure of learning, does not make use of its full potential. The first trial, in which the naive animal has no due about the location of the exit, more often than not is treated as a training run, and, therefore, is not considered for further analysis of (learning) behavior. In a study report, the reviewer will be told how many animals failed to reach the exit in time, but the reasons why they failed to do so are never described. If this were done, we could gain insight into problemsolving abilities and strategies that might be more sensitive to chemical insults than simple learning tasks; this also may be more relevant for extrapolation to humans and for risk assessment.
Here human observers have definite advantages over automated systems. They are able to recognize behavioral changes in the subjects that have not been anticipated and are therefore not covered by the recording procedure of the program. On the other hand, humans are at a severe disadvantage when they are asked to carry out robotic functions, such as observing large numbers of animals in a specific test for hours and recording behavioral parameters. Human operators become bored or tired and their attention wanders unless it is triggered by something unusual. To design tests that can be employed safely in the detection of neurobehavioral toxicity, it is necessary to understand these limitations and to use both human observers and automated tests for the purposes they can serve besthumans to spot any uncommon and unpredicted response and computers for counting and recording tasks that can be anticipated and programmed.
What Have We Learned from Over 10 Years of Testing Therapeutic Agents?
In the overview that follows, we have followed the interpretation of the investigators who conducted the studies in categorizing Environmental Health Perspectives -Vol 104, Supplement 2 -April 1996 findings as positive or negative. It must be Table 2. Fertility and general reproduction studies. for species selection in current guidelines Pregnancy duration shortened 3 0

5) and not only with those compounds
'Data represent the number of positive studies, with the that are known to be centrally acting. In 24 total number of studies shown in parentheses. of all the substances tested, behavioral changes were found either to be the only adverse effects that could be detected at Almost all behavioral testing batteries learning and center latency in the open any dose, or they occurred at the LOAEL contain one or more tests to measure activ-field test ( Table 6). together with other signs of developmental ity. From experiences with the testing of new Often effects are detected only in one toxicity. Seven of these 24 compounds drugs, these seem to be very sensitive in picksex (Table 7). WhAether this is due to a true were antibiotic drugs. Since the effects ing up effects at low doses, maybe overly so, sex-specific action of the compound canwere not expected, this shows the necessity but for a detection study this would not be not be decided, as studies for secondary of conducting developmental neurotoxicity considered a disadvantage. Other tests and characterization are usually performed tests for all substances to which the develparameters that showed significant changes only if malformations are encountered in oping human will be exposed. at low doses are active and passive avoidance the routine studies, not for a suspected Table 6. Behavioral tests giving positive results in detection studies for therapeutic agents.