UniLogistic : A SAS Macro for Descriptive and Univariable Logistic Regression Analyses

Descriptive and univariable logistic regression analyses are essential before constructing multivariable models, but are very time consuming, particularly if a large number of explanatory variables are to be evaluated. A macro UniLogistic is described in this paper that conducts descriptive and univariable logistic regression analyses (binomial, ordinal or nominal, as required) in SAS and presents results in formatted tables in Excel and graphics in PDF ﬁles. Implementation of the macro is illustrated in this paper using example datasets from statistics and epidemiology textbooks.


Introduction
Descriptive and univariable analyses are recommended before building multivariable logistic regression models because they provide information about distributions of variables and crude associations of explanatory variables with the outcome (Vittinghoff, Glidden, Shiboski, and McCulloch 2005). These analyses also enable identification of data entry errors, missing values and outliers, and thus aid in cleaning the dataset before building multivariable models. The importance of these analyses cannot be overemphasized, but they are often ignored or rushed through due to considerable time required in conducting them (Dohoo, Martin, and Stryhn 2004).
A SAS macro UniLogistic is described in this paper that conducts almost all the descriptive and univariable logistic regression analyses (binomial, ordinal or multinomial) required to make informed decisions before building multivariable models. The macro uses the SAS statistical program (SAS Institute Inc. 2003a) for conducting analyses but creates preformatted tables of results in Excel (Microsoft Corporation 2003) that can be easily adapted for journal articles or research reports. Graphical summaries are saved in portable document files (PDF, Adobe Systems Incorporated 2010) in the directory specified by the user. All these results are produced very quickly, but more importantly, the code for implementing this macro is very straightforward and consists of just four arguments in its simplest form.
The macro works in the Microsoft Windows environment and requires SAS 9.1 or above (at least the SAS/BASE and the SAS/STAT components, SAS Institute Inc. 2003c), Acrobat reader (Adobe Systems Incorporated 2010) and Excel (Microsoft Corporation 2003) programs to function. Note that the PDF files are automatically created by the macro and the user does not need to have Acrobat or any other PDF creator installed on the computer.
In this paper, I will first describe the macro -its arguments, implementation and the output it produces -and then present examples demonstrating use of the macro for conducting binomial, ordinal and multinomial logistic regression analyses.

The macro arguments
UniLogistic macro is included with this article in the file UniLogistic.sas. The arguments taken by the macro are summarized in Tables 1 and 2. Only the first four arguments are essential for conducting binomial logistic regression analyses (Table 1), i.e., the dataset name (DSN), the name of the outcome variable (OUTCOME), the names of the categorical (CATVAR) and continuous explanatory variables (CONTVAR). For conducting ordinal and multinomial logistic regression analyses, the user will need to specify an additional MODELTYPE argument that takes values ordinal and multinomial, respectively, for fitting cumulative and generalized logit models. All the remaining arguments are optional or have some default values.
XLFILE and DIR arguments take the names of the Excel file and the directory, respectively,  to store the results. If not specified, by default the macro will save the file as UniLogistic Results in the C:\Temp directory. Although optional, it is preferable to specify XLFILE and DIR arguments to avoid accidentally overwriting a file previously saved with the same name and directory.
See Table 2 for details of other arguments required to customize the analyses or the output.

Implementation and the output
The user calls UniLogistic macro by inputting the arguments discussed above (Tables 1  and 2). Before conducting the actual analyses, the macro performs some error checks to discover potential problems. If an error is detected (say, if the user has not specified the dataset name), the macro will cancel execution and print a an error message in the SAS log. If all specifications are in order, the macro connects to Excel (Microsoft Corporation 2003) using the method of Roper (2000), creates and saves an Excel file with a desired number of sheets to store the results, and turns-off Excel error messages to ensure trouble-free access from SAS (SAS Institute Inc. 2003a).
After these preliminary actions, the macro conducts descriptive and univariable logistic regression analyses, saves the desired results in SAS datasets and then exports the saved results to Excel and PDF files. Before quitting, the macro deletes all the datasets created during the implementation to avoid cluttering, and to prevent errors in case the macro is implemented again.
First, we will discuss the procedure followed by the macro to conduct analyses and to create tables, and then the approach used to create and save graphics.

Create datasets
A nested macro UniBasic is defined within UniLogistic macro to conduct logistic regression analyses using the SAS Logistic procedure (SAS Institute Inc. 2003c). UniBasic macro is implemented for each categorical and quantitative variable individually by scanning the list of variables specified by the user. The results are saved in six SAS datasets for each variable using the Output Delivery System (ODS) and include the likelihood-ratio χ 2 results, parameter estimates, odds ratios, goodness-of-fit results, contingency tables for categorical variables, summary statistics for quantitative variables, and the information for convergence. This nested macro also creates a dataset to store the number of missing observations for each variable using the SAS structured query language (SQL) procedure.
Datasets are also created for each pair of variables to evaluate associations between all possible combinations of explanatory variables using the FREQ procedure in SAS (SAS Institute Inc. 2003b). These datasets are used to create a table of associations between all explanatory variables to evaluate collinearity (Spearman rank correlation coefficient, χ 2 , p value or other parameters as requested). Additionally, Pearson correlation coefficients are calculated for each combination of quantitative variables using the CORR procedure in SAS (SAS Institute Inc. 2003b).

Combine datasets
The datasets containing the likelihood-ratio χ 2 results, the goodness-of-fit results and the convergence information for all explanatory variables are concatenated and then merged (using a set of SQL procedures) to create a combined dataset containing all these results. Similar combined datasets are created for (1) parameter estimates and odds ratio results, (2) contingency tables for categorical variables, (3) summary statistics for quantitative variables, (4) association results for all pairs of variables, and (5) correlation results for all pairs of quantitative variables. See the macro code available on the journal website for the details.

Export datasets to Excel
Selected results from the combined datasets are then exported to the Excel file using the Dynamic Data Exchange (DDE) facility in SAS (Vyverman 2001(Vyverman , 2002. This facility enables customized export of data from SAS to a specified cell (or a range of cells) of an Excel file. The macro not only uses this facility to export data but also to export titles, headers and notes/messages for the user. To ensure that the results are exported to the desired location, the macro counts the number of observations in each combined dataset using the SQL procedure, prior to exporting results.

Format results in Excel
The results exported to Excel are then formatted to make them reader-friendly and to ensure that only a minimal effort is required to reformat the tables for publication. This involves changing the font for headers to Arial 12 size bold, applying borders at the top and the bottom of the table, rounding most numbers to two and the p values to three decimal places, and setting the columns width to fit contents. Although rounding of numbers can be accomplished directly in SAS, performing this operation in Excel allows the user to change the formatting at a later stage.

Customize analyses
For calculation of parameter estimates, by default, the macro makes the first ordered category of a categorical variable as a reference category. However, the last ordered category can be made as a reference category by using ref = last as an additional argument (Table 2).
UniLogistic macro models the probabilities for the highest outcome category, by default. This is same as explicitly specifying the option descending in the SAS PROC Logistic statement. For example, for a binary outcome variable in 1 (event) and 0 (non event) format, the macro models the probability of 1 (event). This also means that for an outcome variable with levels 1 and 2, the macro will model the probability of 2. For an ordinal outcome variable, the macro models the probability of a higher score and for a multinomial outcome variable, the macro makes the first category as a reference. However, the user can change these orders by specifying the order = ascending option in the macro implementation code (see Table 2).
Similarly, by default, only the Spearman rank correlation coefficient estimate, χ 2 and p value of the association between each pair are exported but these default settings can be changed using the STATISTIC and CHISTATISTIC options described in Table 2.

Graphical summaries
The macro creates four types of graphical summaries to aid in visual assessment of the associations and the distributions of the variables. A histogram for each quantitative variable and a box-and-whisker plot for each quantitative variable grouped by the outcome are created using UNIVARIATE and BOXPLOT procedures in SAS, respectively. Formatting is done using GOPTIONS statement in SAS and the graphics are saved in PDF files to the user specified directory using the SAS FILENAME statement.
In addition, two bar charts are created for each categorical variable using the SAS GCHART procedure, a simple bar chart describing the frequency distribution of the variable and a grouped bar chart by the categories of the outcome variable. Similar to above, these charts are saved in PDF files in the specified directory. To enable accessibility from just one location, hyperlinks for the PDF files are created in the Excel file in which the results are stored.

Binomial logistic regression
The dataset Implementation of the UniLogistic macro is demonstrated using an example dataset NOCARDIA, described in detail in Dohoo et al. (2004). This dataset is from a case-control study conducted to investigate an outbreak of Nocardia mastitis (Ferns, Dohoo, and Donald 1991), in particular, to evaluate the association of various dry cow therapies with casecont, the case or control status of a herd (1/0).
We will assume that the data are in SAS format and are stored in the C:\myfolder directory. The dataset, UniLogistic macro and its implementation code are available as supplementary materials from the journal website.

Implementation
Specify the %INCLUDE statement to indicate the location of the macro file, input values for various arguments as shown in the code below (also see Tables 1 and 2) and invoke the macro. The output Example output file for the analyses conducted using the NOCARDIA dataset (Dohoo et al. 2004) is available as a supplementary material on the journal website. Sheet 1 of the Excel file displays two tables: (1) contingency tables of categorical variables with the outcome; and (2) summary statistics for quantitative variables by the outcome (Figure 1 and 2). Note that the titles and headers of tables were created by the macro. The macro also applied the formatting including the borders, rounding of the summary statistics to one decimal place, and fitting the column widths to their contents.
Also note that the links are provided in this sheet to view graphical summaries. These links will only work if the user implements the macro on their own computer. However, the example graphics can be downloaded from the journal website for inspection.
Sheet 2 contains a table of likelihood-ratio χ 2 statistics and p values for the unconditional associations of explanatory variables with the outcome (Figure 3). In addition, it contains two    lists of variables (Figure 4), one sorted by p values (to help select variables for multivariable analysis), and the other by the number of missing values (to verify if any variable has missing observations). Note that the table of likelihood-ratio χ 2 results contains a column displaying information about model convergence. Also note that the variables for which the model did not converge were automatically omitted from the list of variables sorted by p values.
Sheet 3 contains a table of parameter estimates, standard errors, odds ratios and their 95% profile likelihood confidence intervals based on the unconditional associations of all explanatory variables with the outcome (Figure 5).
Sheet 4 contains the results of associations between all explanatory variables ( Figure 6) and Pearson correlation coefficients between quantitative variables to help evaluate collinearity. See the example output file Mastitis Results.xls for details.

Ordinal and multinomial logistic regression
The procedures for conducting ordinal and multinomial logistic regression analyses for polytomous outcomes are similar to that of binomial logistic regression, and are therefore being discussed only briefly. There is only one additional argument MODELTYPE which takes values ordinal and multinomial, respectively, for fitting a cumulative logit and generalized logit models. The generalized logit model is simply an extension of the binomial logistic model for a polytomous outcome variable and compares log odds for each category of the outcome  to the reference level. In contrast, the outcome is assumed to have a natural ordering in the cumulative logit model and the odds ratios are considered to be independent of the categories of the outcome variable.

The dataset
We will use the low birth weight dataset (LOWBWT) for conducting ordinal and multinomial logistic regression analyses, assuming the outcome to be ordinal and nominal, respectively. This dataset is from a study conducted to investigate factors associated with low birth weight and is described in detail in Hosmer and Lemeshow (2000).

Implementation and the output
Implementation of the macro is similar to conducting binomial logistic regression analyses except the use of an additional MODELTYPE argument. The following code will evaluate unconditional association of six categorical variables (RACE, SMOKE, PTL, HT, UI, FTV) and one continuous variable (AGE) with the ordinal outcome (WTCAT) and save the results in an Excel file lbwt_results_ordinal in the directory C:\myfolder. See Tables 1 and 2 for further details.
The output produced will be similar to as discussed above for binomial logistic regression except the table of parameter estimates and odds ratios which will, of course, show three intercepts for each variable instead of just one in the binomial logistic regression (Figure 7). See the output file lbwt_results_ordinal available as a supplementary material on the journal website for further details.  The code for building generalized logit models is the same as for ordinal logistic models except that the user needs to specify multinomial for the MODELTYPE argument: The output produced by the macro will also be similar to the one produced for building ordinal models (of course, the estimates will be different for different model types) except the table of parameter estimates and odds ratios, which will contain another column of the levels of outcome variable (Figure 8).
The dataset, the implementation codes and the detailed outputs produced by the macro for both ordinal and multinomial logistic regression analyses are available from the journal website.

Discussion
Significant time is consumed in properly conducting descriptive and univariable logistic regression analyses. This results in a tendency by some analysts to jump straight to multivariable analyses which invariably results in erroneous conclusions (Dohoo et al. 2004). The macro described in this paper saves time for conducting descriptive and univariable logistic regression analyses, and therefore, encourages analysts to follow a systematic approach to data analyses. This will enable better conclusions from studies, particularly with a large number of explanatory variables, a common scenario in epidemiologic research (Bagley, White, and Golomb 2001).
Time is required not only for conducting analyses, but also for manually creating tables of results for preparing journal articles or research reports based on the output from a statistical program. UniLogistic macro automatically prepares tables of results that can be easily adapted for various purposes, further saving researchers' precious time.
From the very outset, the aim was to create a program that is easier to implement by users because many of the potential users were likely to be graduate students. Therefore, in its simplest form, the user only need to specify four arguments to implement the macro, and all the analyses are conducted, results compiled, tables and graphs created in less than a minute on most personal computers. This user-friendly nature of this macro encouraged not only graduate but also many undergraduate students to use the macro for analyzing their research project data at our institute.
Of course, like any program, this macro has limitations. First, it does not have as many options as are available in a standard statistical package. Second, although peer reviewers have tested the macro and every care has been taken to remove any potential errors, it has not been probably tested under all possible conditions. To safeguard the user against erroneous results, however, the macro does not delete any major SAS output, so that the user can inspect the output in addition to inspecting the tabulated results in the Excel file. Third, the user must have basic knowledge and understanding of SAS (SAS Institute Inc. 2003a) to implement the macro. Finally, the macro does not perform multivariable analyses but gives a user almost all the information to start building multivariable models.
In addition to these limitations, some precautions are necessary during implementation. As the macro creates and saves Excel file itself for exporting results, the user must ensure that Excel program is closed before invoking the macro, and that any other Excel file of the same name as specified in the macro code must not be present in the nominated directory. Secondly, to function properly, the macro turns off Excel error messages. Therefore, the user must save any changes made in the Excel file after its creation because Excel will not prompt the user to save these changes. However, if the Excel program is closed and reopened, the original settings are restored, i.e., Excel will start prompting the user to save any changes made in the file. Thirdly, the hyperlinks created in the Excel file to access PDF files refer to the nominated directory and the file name. Therefore, the links will not work if the user tries to open the files via the hyperlinks on some other computer where the analyses were not conducted. In that case the user will need to physically open the files from the directory where the files are stored.
Other than these limitations and implementation issues, the macro functions smoothly and can save a lot of time for users besides encouraging them to follow a systematic approach to conduct analyses.