Statistical computing in the United States.

Recent history and developments related to the increase in statistical computing activities in the United States and by U.S. participants in international efforts are reviewed, with emphasis on important events, organizations, references, and products which contribute to informed selection and use of statistical programs. Three features matrices for major statistical packages are included as potential aids to Japanese statisticians in assessing the utility of these packages in biostatistical applications.


Introduction
Computers and statistical computing play an increasingly important role in statistical analysis of data in many areas of application, and biostatistics is no exception. The computer is a primary tool for most modem statisticians, for reasons of convenience, necessity, or both. It is convenient as it shortens the time and effort required for computations, enabling almost concurrent progress of numeric and analytic processes in the statistical evaluation of data, thus enhancing the interaction between data and ideas. It is necessary, as it permits computations of a scope or complexity that was often considered but not routinely feasible only 10 to 20 years ago. The availability of computer programs or packages for statistical analyses establishes to some extent de facto standards for the choice and application of statistical methodology to the analysis of data.
In recent years the statistical profession has become increasingly aware of the importance of these developments and of the need to monitor, guide, and evaluate them. Appropriate and adequate mechanisms to meet this need are evolving even as statistical computing itself continues to develop, and there are signs that the gap between ongoing development and critical evaluation is at least growing more slowly, if not actually narrowing. Major efforts are now underway, both in individual countries and internationally, toward reducing this gap. This presentation is (1) a summary of recent history and developments in statistical computing *Office of Biometry and Epidemiology, National Eye Institute, National Institutes of Health, Bethesda, Maryland 20205, U.S. primarily in the U.S., with special emphasis on availability and evaluation of statistical program packages, and (2) a guide to current, important references and resources which may assist a prospective user of statistical programs in becoming aware of activities and reports relevant to a more informed selection and use. Although references and resources to be mentioned are of quite general interest in terms of areas of statistical application, those included are usually relevant to the broad interests of biostatisticians if not also to their sometimes specialized interests. Some references are excluded from this presentation either somewhat arbitrarily or as perhaps too distant from biostatistical applications.

A View of Recent History
This brief review of the history and development of statistical computing is selective and meant to be informative rather than exhaustive.
By the early 1960's, the development and availability of computer hardware and software had encouraged individuals and groups to produce general statistical programs and packages for routine use in data analysis. Some of the pioneers in this effort are still prominent contributors to statistical computing, through continuing dedication, perseverance and skill. Professional recognition and motivation for these activities, necessary for their further development, has often been erratic and inadequate, with improvement generally slow to occur but increasing in the present decade.
Three major events in the late 1960's marked the beginnings of today's organized professional interest October 1979 in statistical computing by statisticians and their societies. The first Computer Science and Statistics: Symposium on the Interface was held in 1967, beginning a series of annual symposia which have gained wide participation by statisticians and sponsorship by statistical societies. The purpose of the symposia has been described as: to facilitate the design and improvement of computing systems and software by use of statistics; to facilitate the implementation of statistical methods by making the most efficient use ofcomputation machinery; to enhance the combined use of statistics and computation as adjuncts to the basic and applied sciences. Mann (1) provides a detailed history of the symposia.
Later in 1967, at the meeting of the International Statistical Institute (ISI) in Sydney, presentations and discussions took place which led to the organization of the Conference on Statistical Computation, held at the University of Wisconsin in 1969, to ". . . present and evaluate the current status of some basic aspects of the organization of statistical data processing and computing, and suggest directions for future research and development" (2). The problems posed at that meetingduplication of effort, communication between statistical programs, definition and specification of data structures, and statistical processing languagescontinue among those being studied today.
The third event was the recognition by the American Statistical Association of the role of computers in statistics, by organizing sessions on the topic beginning with the 1967 annual meeting, followed by creation of an ad hoc Committee on Computers in Statistics.
These rather modest events were important because they provided the basis for more substantive developments in the 1970's. The Symposia on the Interface and the ASA sessions grew to provide an effective forum for presentation and recognition of work by professional statisticians in statistical computing. The ad hoc ASA committee was succeeded in 1972 by the Section on Statistical Computing, which has become increasingly active in pursuing the areas outlined in its charter (3)as follows: The principal areas of interest of the Section shall be, (1) to encourage the application of computer hardware, software and systems to statistical problems, (2) to encourage the application of statistical techniques to the design, maintenance and evaluation of computer hardware, software and systems, (3) to encourage thejoint application of statistical techniques and computer technology to problems in other fields, and (4) to serve as the focal point for computer-oriented activities within the Association and for cooperation with computer-oriented organizations.

Functions
The Section will perform such functions as will support the areas of interest specified above. These will include: (1) sponsorship, orjoint sponsorship with other organizations, of meetings, seminars or courses which involve statistics and computers; (2) planning, in cooperation with the General Program Committee, sessions on statistical computing and statistical computer science at annual or regional meetings ofthe Association; (3) sponsorship, orjoint sponsorship with other organizations, of documentation of computer programs and algorithms of special interest to statisticians, of development of manuals and booklets and of publication of bibliographies, etc.
The Proceedings of the Section have been published since 1975. The Section established an ad hoc Committee on Evaluation of Statistical Program Packages, to initiate, carry out, and promote evaluation activities. In 1976 this committee was succeeded by the standing Committee on Statistical Software, with broader scope in promotion and guidance of related activities and less involvement in particular evaluations.
The interest in international cooperation, as shown at the 1967 meeting of the ISI and fostered by the 1969 Wisconsin conference, has grown through the continuing persistent efforts of many individuals but with especially vital participation by some from the start in 1967, leading to the establishment in 1977 of the International Association for Statistical Computing (IASC) as a new section of the ISI.
Highlighting of these activities is not intended to diminish the roles and importance of other efforts to promote and develop statistical computing, both in the U.S. and in other countries. Many individuals have been prominent and effective in more than one of these and other activities, contributing to increased communication and cooperation. This interaction extends to professional societies and meetings, especially through cooperative efforts in organizing sessions of mutual or complementary interest.

Influence on Development of Statistical Software
Statistical computing may be viewed as consisting of the two parallel but overlapping areas of theory and application. Theory includes development of algorithms, evaluation and improvement of accuracy and efficiency, critical comparison of techniques, and statistical and mathematical computations in development of statistical theory. Products of theory are sometimes single-purpose, stand-alone computer programs not intended for general use or distribution but which eventually receive limited use Environmental Health Perspectives beyond the developer's environment. Other products include reports in the literature and presentations at scientific meetings. Application includes the assembling, packaging, documentation, distribution, and maintenance of a statistical program, group of programs, or system. The major product of application is the available (distributed) software. These areas are clearly related, and each profits from the work of the other. They also have nonoverlapping, distinct interests.
The recent history of statistical computing, in which it has received increased recognition and appreciation as a professional activity, has encouraged growth and production in theory and application, both separately and as they interact. Results from theory are more readily known and available to application, and needs of application receive acknowledged research from theory. The end result is the opportunity for continued, meaningful growth in the quality and availability of statistical software.

Resources in Evaluating Statistical Software
Guidance for a prospective user in assessing the availability and quality of statistical software may be found from an increasing number of resources. Especially within the last five years, the improved climate for professional activities in statistical computing has led to progress in providing adequate answers to such questions as: What statistical programs (or systems) are available? Does the program do what I need, with verified accuracy? Can the program be used in my computer? Is the program adequately documented and easy to use? Is there help available when difficulties arise in the use of the program, especially if (when) errors are found?
A framework within which these and other questions can be effectively organized and addressed in the evaluation and improvement of statistical software was recently suggested by Francis (4), as he examined the interrelationships among the elements of the framework and summarized important results and developments in each area: A separate volume of the publication of the International Statistical Institute, 1979, contains information on capabilities and use of 46 major statistical computing packages, developed from a comprehensive questionnaire and poster session coordinated by Ivor Francis for the IASC program of the 41st Ses-sion of the ISI, New Delhi (6). The summary matrix of program capabilities and features from that volume, together with an abbreviated description, is included here (Appendix I) with the permission of Professor Francis. It is planned that this information, together with other information to be collected by the IASC (7) will be provided in the future through an international information bank and information exchange on statistical computing software under Project TIESS (Technical Information Exchange on Statistical Software) of the IASC.
An earlier survey (8) of 56 publicly available statistical program packages provided a 119-page index containing a cross-reference listing of packages and general capabilities, a listing of developer's answers to selected questions concerning the design aspects of their packages, and abstracts of the package written by the developers (all available on microfiche). Matrices of general capabilities and selected questionnaire replies are included in Appendix II, with permission of the authors.
A third, recent comparison of nine major statistical packages by a featured matrix approach is based on user evaluation of developer's documentation and manuals, rather than direct response from developers to a questionnaire (9). This features matrix is included in Appendix III, with permission of the authors.
These summaries of program features are a useful resource to prospective users as a first step in evaluation of statistical program packages. Further developments in this area are anticipated and desirable. Evaluation criteria for statistical program packages are being developed and proposed (10,11).
Critical review and comparison of programs, especially in terms of appropriate implementation of statistical methodology, accuracy, efficiency, and documentation, is perhaps more important and useful to a prospective (or current) user. The present, enlightened attitude of professional statisticians and societies has encouraged aptivity of this kind. Many recent papers of interest in this area are found in the Proceedings of the Section on Statistical Computing of the ASA and the Proceedings of Computer Science and Statistics: Annual Symposia on the Interface, and most are included in Francis' bibliographies (4,5). Berk and Francis (12) and Muller (13) provide comprehensive, critical reviews of user manuals for two widely used statistical systems, BMDP (14) and SPSS (15).
Comparisons and critical reviews are needed, but they are produced in a dynamic environment and must be carefully examined for relevance to the currently available versions of programs, packages and documentation. Improvements are often concurrent with evaluation. For example, the extensive reviews of SPSS and BMDP user guides have appeared just as the new BMDP-77 manual became available (16).

Statistical Computing, Cancer Research, and Japan
Applications of biostatistics in cancer research will find useful features in many different statistical computing packages or systems. Few packages have been developed with special attention to the needs of statistical analysis of biological or medical data, a notable exception being the BMD-BMDP series of programs (14,16,17). The BMD programs were pioneers in the packaging, documenting, distribution, and maintenance of statistical systems, and their special emphasis and development of biomedical applications receives continued support from the National Institutes of Health. Inclusion in BMDP-77 of a program for log-linear analysis of contingency tables, and plans for early release of a program for stepwise multiple logistic analysis, reflect an ongoing effort to continue the special emphasis. Nevertheless, statistical computing in cancer research will undoubtedly find useful tools of different kinds in many software packages. Also, some of the needed tools will only be found in stand-alone, special-purpose programs, or not be found at all and have to be created ad hoc. For this latter purpose, the International Mathematical and Statistical Libraries (IMSL) subprogram library offers considerable resources (18).
Some of the major U.S. statistical computing systems are not available in Japan, and other system developers are interested in expanding their distribution to Japan. One system, Omnitab-78, claims to be multilingual to the extent that it could be provided in a Japanese version using the English alphabet (19). Portability of software may be an especially important question in transporting a U.S. system to a Japanese computer.
As a further guide to statistical computing in the U.S. with possible special relevance to Japan, the developers of the major U.S. systems included by Kohm, Ryan, and Velleman in their index of statistical software (8) were asked to briefly describe their present Japanese distribution (if any) and their interest in future distribution in Japan. With their reply they also sent current documentation for their systems, e.g., user guides or manuals, and primary journal articles which describe their systems. These materials, together with the Proceedings of the Section on Statistical Computing of the ASA for 1976 and 1977, the Proceedings of Computer Science and Statistics: Annual Symposia on the Interface for the Environmental Health Perspectives 8th, 9th, 10th and 11th Symposia, 1975-1978, andFrancis' bibliography (5) are provided for repository with a Japanese institution as an up-to-date information shelf on statistical computing in the U.S.

Conclusion
Recent activities in statistical computing, both in the United States and through U.S. participants in international efforts, point to increasing progress, through statistical societies, meetings, and publications, toward research and products which benefit the users of computers in statistics. Users of both established statistical packages and of less widely publicized stand-alone programs can gain from the increased focusing of interest in more serious and visible work in standards, evaluations, critical comparisons, new methods, portability, distribution, and documentation.
Current directions suggested by recent meetings of societies and committees, and by their leaders and members, are ongoing work in evaluations and standards, with growing emphasis in interactive computing and in management of statistical data bases.

Appendix 1. Features Matrix by Francis.
This matrix displays responses by package developers to questions in four areas: capabilities, portability, ease of learning and using, and reliability. The complete forms ofthe abbreviated items shown in the matrix are found in the reference. Numeric responses are according to a 0-3 rating scheme: (0) no facilities in this area, or not intended as a goal; (1) a few functions in this area are present, or a minor goal or byproduct; (2) moderate capabilities, or a significant goal; (3) complete coverage of all aspects of this area, or one of the principal goals.
The codes in the "General Capabilities Listing" were selected by the program developers in evaluation of the capabilities of the programs listed:

C Capability
The program or package has sufficient capabilities in this area to be considered as a feature.

L Limited
The program or package has some capabilities in this area, but they should be considered as limited.
D Documented The feature can be easily accomplished using the documentation supplied with the program, but is not a standard ("built-in") option.