Teaching Statistical Inference Through a Conceptual Lens: A Spin on Existing Methods with Examples

Abstract Using software to teach statistical inference in introductory courses opens the door for methods and practices that are more conceptually appealing to students. With an increasing number of fields requiring competency in statistics including data science, natural and social sciences, public health and more, it is crucial that we as instructors deliver the basic concepts of statistics effectively. In line with guidelines presented in the GAISE College Report, this article demonstrates intuitive approaches to teaching proportion and mean inference that take advantage of statistical software and emphasize conceptual understanding. The article recommends putting aside asymptotic-based methods for proportion inference and using the exact binomial method. Regarding mean inference, we propose a more contextualized and simplified process that uses the distribution of the sample mean directly and avoids standardized statistics such as z or t. In both the proportion and mean inference contexts, we discuss the benefits of the proposed approaches and provide detailed examples that demonstrate the methods using the Rguroo statistical software.

While this article's methods for teaching proportion and mean inference do not include simulation, we give credence to ideas included in Cobb (2015) and GAISE (2016) who recommend using simulation-based methods to teach inference.However, adopting simulation-based methods in teaching introductory courses has been slow, perhaps because of the following reasons: (a) Some instructors of introductory courses may not have formal training in statistics or may be uncomfortable or unfamiliar with simulation-based methods.According to the CBMS 2015 report, over 60% of faculty who teach introductory statistics courses in the Mathematics and Statistics departments are not amongst tenured or tenure-eligible faculty.(b) Teaching simulation concepts to introductory students requires a good amount of class time.This time may not be available to instructors who must cover a set of topics within a semester or a quarter.(c) Simulation machinery may be difficult to learn for introductory students, especially if it involves writing computer code.A note on this point is that applets such as ArtofStat, Rossman/Chance, or StatKey provide great simulation tools to teach sampling distributions.However, statistical software where students can save and reproduce results should be used for exercises and assessment.
Although simulations are effective in teaching sampling variability, based on our experience, they are not as effective in introducing inference concepts.We recommend presenting the elements of inference using one-population proportion inference and propose the use of the exact binomial method.As we illustrate in Section 2, this approach allows us to put aside the required simulation machinery and, as a result, makes it simpler for students to focus on the underlying concepts of inference.Additionally, using the binomial distribution avoids unnecessary complications such as sample size restrictions that arise with approximating the binomial by the normal through the central limit theorem.This recommendation is in line with Chance and Rossman (2001), who state "Important but peripheral concerns associated with inference for means should wait until students have an understanding of basic inferential principles.Studying proportions first also allows for exact calculations of p-values and power from the binomial distribution." As in proportion inference, understanding mean inference has its own challenges for students.Grasping the sampling distribution of the sample mean, which forms the foundation of mean inference, is arguably more complicated than that of the sample proportion.Again, using simulation to understand the sampling distribution of the sample mean is helpful; however, moving on with simulation-based methods to teach mean inference adds similar complications as those described previously in the proportion inference context.All textbooks that we researched, including Diez, Cetinkaya-Rundel, and Barr (2019), Lock et al. (2021), andTintle et al. (2021) that emphasize simulation-based methods include the normal and t-based methods for mean inference.A question then is whether there is room to improve the way we teach the normal-and t-based methods.
To our knowledge, every introductory statistics text transforms the sample mean to obtain the unitless standard z-or t-statistic.As we will show in Section 3, we can eliminate the standardization step in both the normal-theory and t-based inference.Two benefits of avoiding standardization are that the number of computation steps is reduced, and working in the original data units allows us to introduce the inferential methods in the units of the data and thus in a natural context.
A key to implementing the methods that we discuss here is the use of appropriate statistical software.While many textbooks use technology in their presentations, there is room for further modernization.Of the 20 introductory statistics textbooks that we looked at, 18 include z and t tables, and 14 provide TI calculator instructions.Considering the limitations of calculators and the availability of affordable statistical software, interestingly, many textbooks have not steered away from using probability tables and calculators.One of GAISE's (2016) goals for students in introductory statistics courses is that "students should be able to interpret and draw conclusions from standard output from statistical software packages." Moreover, there are indications that using software improves overall course success rates (Robinson 2020).The GAISE College Report (2016) lists ten considerations for teachers when selecting technology tools.In this article, we use the Rguroo statistical software (https:// rguroo.com/) to present our examples.This software adheres to the ten GAISE technology guidelines.

Teaching Proportion Inference Using Exact Binomial
In teaching inference about a population proportion p, most introductory statistics texts use the sampling distribution of the sample proportion p = X/n, where X is the number of successes in n trials of an experiment.Of these textbooks, most use the asymptotic distribution and a few texts (e.g., Lock et al. 2021;Tintle et al. 2021) obtain the sampling distribution of p by simulation.
If a student has understood the central limit theorem, then it is reasonable to assume that they would have a conceptual understanding of the asymptotic result in (1).From our experience, however, most students in an introductory course do not get a good grasp of the central limit theorem and tend to follow prescribed steps to solve proportion inference problems.Additional complexities in this approach include introducing the formula for the standard error of p, shown in (1), and limitations that are imposed on n and p by requiring large sample sizes.Some books introduce relatively simple methods to deal with small n, for example, the "plus four" method of Agresti and Coull (1998) for obtaining 95% confidence intervals; yet, this adds another layer of complexity.
Demonstrating the distribution of p using simulation offers students a conceptual understanding of sampling variability and distribution.Since counts are generally easier for students to fathom than proportions, we prefer teaching sampling variability by simulating from the distribution of X, the number of successes in n trials, where ( 2 ) Only after the idea of sampling variability is understood, we state the equivalence of the distribution of p and X.
As noted in the Introduction, we recommend teaching proportion inference by computing binomial distribution probabilities directly through the exact binomial method.This only requires the knowledge of the binomial random variable, a topic that is covered early on in most introductory courses.Computation of binomial probabilities, which is generally difficult to do by hand, will be left to software.
The idea of using the exact binomial method for inference about a population proportion dates back to about 90 years ago, as proposed by Clopper and Pearson (1934).But this method is not used as a main method in introductory textbooks.The older textbooks did not cover the exact binomial method, and reasonably so, because computational software was not widely available.Fortunately, calculating binomial probabilities for arbitrary values of n and p is now readily available via statistical software.
Three advantages of using the exact binomial method are as follows: (a) Count data has an elementary construct that helps students learn elements of inference in a context-friendly setting.(b) Knowledge of the central limit theorem, simulation methods, or formulas is not required.(c) Student learning is not impeded by adding assumptions on the sample size n and the population proportion p.
In the following two sections, we give details and examples of the exact binomial method for hypothesis testing and obtaining confidence intervals for a one-population proportion.

Hypothesis Testing
In teaching hypothesis testing about a population proportion, we start by introducing the following three types of null and alternative hypotheses: where p 0 is a given fixed value.We explain to our students that prior to looking at the data, we should decide on a significance level α for our test, which determines the amount (probability) of Type I error that we are willing to tolerate.In using the exact binomial test, the decision on whether to reject or not reject the null hypothesis will depend on our test statistic X, the number of successes that we observe in a sample of size n.Having learned the binomial distribution, students can identify the distribution of X as binomial with the number of trials n and probability of success p 0 .We refer to this distribution as the null distribution since p 0 is the value of p in the null hypothesis H 0 : p = p 0 .In the following two sections, we explain how to make a decision on rejecting or not rejecting the null hypothesis based on a critical region and a p-value.

Decision Based on a Critical Region
The critical region approach to testing a hypothesis about a population proportion at a significance level α involves determining a set of X-values that form the critical (rejection) region.These X-values favor the alternative hypothesis and must satisfy the following two conditions: Condition 1: The probability that X belongs to the critical region is at most α under the null distribution X ∼ Binomial(n, p 0 ).Condition 2: The X-values are determined so that the Type II error is minimized.
Because the binomial distribution is discrete, a limitation of the exact binomial test is that we are not guaranteed a size α test where we would achieve the exact significance level, but we get a level α test, where we guarantee that the Type I error remains below the significance level α.The distinction between the size α test and level α test is not very important in introductory courses or generally in practice.
When explaining critical regions to students, we suggest starting with a one-sided test.Consider an alternative of the form H a : p > p 0 .With a bit of guidance, students can conclude that large values of X favor the alternative hypothesis H a , and our critical region would consist of values in the right tail of the distribution of X.Now let's go through an example.
Example 1 (Pineapple Example).In this example, we conduct a survey of the students in our classroom to answer the question: Do more than 20% of students in our class like to have pineapple on their pizza?The hypotheses here are where p is the proportion of students in our class who like pineapple on their pizza.We take a random sample of size n = 10 from our class and ask each of the selected students if they like pineapple on their pizza.1Before revealing the result of our sample, we ask students to guess the critical region.We then use our probability calculator to test whether their guesses satisfy Condition 1.After this exercise, we use a probability calculator to find the critical region.
Figure 1(a) shows Rguroo's probability calculator dialog box.In the dialog box we select the Probability ⇒ Values option to indicate that we are computing an inverse probability.We then select the Binomial distribution, specify the parameters No of Trials, n = 10, Prob of Success, p = 0.20, select the option Upper Tail, and type in our significance level 0.05. Figure 1(b) shows the resulting output.It shows that P(X ≥ 4) = 0.1209 > 0.05 and P(X ≥ 5) = 0.03279 < 0.05.Therefore, we conclude that our critical region consists of X-values greater than or equal to 5. In other words, if five or more students in our sample stated that they like pineapple on their pizza, we reject the null hypothesis.
Usually, multiple student guesses satisfy Condition 1 and not Condition 2. We then pose the question of why X ≥ 5 is the correct critical region as opposed to other regions that satisfy Condition 1.To explain this to students, we take the example of X ≥ 6.We compute P(X ≥ 6) = 0.0064 which is less than 0.05 and therefore satisfies Condition 1.However, we note that Condition 2 requires a region consisting of X-values that minimizes the Type II error.The Type II error for the critical region X ≥ 5 is smaller than that of X ≥ 6.We can clarify this point to our students without computing the Type II error directly, and by noting that the region X ≥ 5 consists of more X-values than the region X ≥ 6.Thus, we have a lower chance of failing to reject H 0 when X ≥ 5 as compared to X ≥ 6, and therefore X ≥ 5 is the region that has a smaller Type II error.Students should understand this explanation since they would know that P(Type II error) = P(Failing to reject H 0 | H a is true).
At this point, we state that the critical region for the cases where H a : p > p 0 consists of the smallest k such that P(X ≥ k) ≤ α, where X ∼ Binomial(n, p 0 ).Analogously, for an alternative hypothesis of the form H 0 : p < p 0 the critical region consists of the largest value of k for which P(X ≤ k) ≤ α.
The critical region for two-sided tests involves both the lower-and the upper-tail of the null distribution.In this case, we need to determine values k 1 < k 2 such that P(X ≤ k 1 ) + P(X ≥ k 2 ) ≤ α while at the same time satisfying Condition 2. We leave the task of obtaining k 1 and k 2 to software because it is somewhat more complex than the one-sided case.The following is a twosided example.
Example 2 (UCLA Example).UCLA's website2 states that 20% of its students are Hispanic.Since some students do not report their ethnicity, we do not know whether this reported value is an underestimate or an overestimate of the true proportion.To investigate, we test the hypotheses H 0 : p = 0.20 versus H a : p = 0.20, where p denotes the true proportion of Hispanic students at UCLA.We took a random sample of size 25 from UCLA students and observed two Hispanic students in our sample.Perform a test of hypothesis at the α = 0.05 significance level.
We use Rguroo's One Population Proportion Inference function to obtain the critical region.Figure 3 includes the Rguroo output that shows the critical region graph for this example.As shown by the red bars in the graph, we would reject the null hypothesis if our x obs is one of 0, 1, 10, 11, . . ., 25.Our observed value is 2, as indicated by the green triangle, which does not fall in the critical region.Therefore, we do not reject the null hypothesis.As noted earlier, due to the discreteness of the binomial distribution, we cannot always obtain a critical region with the exact significance level of α = 0.05.The graph legend shows the exact significance level of α = 0.044722.

Decision Based on a p-Value
We begin by introducing the p-value as a measure of how strongly the evidence in our observed sample favors the alternative hypothesis assuming that the null hypothesis is true.In teaching this concept, we go through the following steps: Step 1: We explain to students that our evidence is x obs , the observed number of successes in our sample of size n.
Step 2: The strength of the evidence is measured using probabilities.In particular, here we use the probabilities from  a binomial distribution with parameters n, and probability of success p 0 .We emphasize that we use p 0 since in our testing procedure we assume that the null hypothesis is true.
Step 3: Using a few examples, we have students think about what we call H a support, the set of values of X that are as favorable or more favorable than x obs to the alternative hypothesis H a under the null distribution.We prefer this wording of "as favorable or more favorable" to the commonly used wording of "as extreme or more extreme, " since favorability exerts a direction.
Step 4: We define the p-value as p-value = P X belongs to H a support | H 0 is true .
Step 5: We decide to reject H 0 if the p-value is smaller than our predetermined significance level α.Otherwise, we fail to reject H 0 .
A key step here is determining the H a support.In the case of one-sided tests, for example H a : p > p 0 , it is not difficult to explain to students that a larger number of successes X is more favorable to the alternative H a .Thus, the H a support for this case consists of x obs , and values of X that are larger than x obs .Similarly, it can be explained that the H a support for H a : p < p 0 consists of x obs and values of X that are less than x obs .
Determining the H a support for two-sided tests is somewhat more involved.We explain to our students that one value being more favorable than another value to H a is measured based on comparing probabilities; specifically, the values of X that are more favorable to H a : p = p 0 than x obs are those with probabilities less than that of X = x obs under the null distribution.Fortunately, this concept can be illustrated using the probability bar graph of the null distribution X ∼ Binomial(n, p 0 ).On the graph, we locate our observed value x obs .Then, the values of X with bars as tall as x obs or shorter form our H a support.We leave the computation of the p-value to software.Let's consider an example.
Example 3 (UCLA Example-Continued).Recall the UCLA Example setting.To begin teaching the p-value approach, we ask students to determine the H a support by looking at the null distribution graph.Figure 4(a) shows a bar graph of the probability mass function for Binomial(n = 25, p = 0.2), the null distribution, obtained using Rguroo's probability calculator.This graph should be familiar to students from when they learned about the binomial distribution.In our example, x obs = 2 with P(X = 2 | p = 0.2) = 0.07084.Students can see from the graph that the values X = 0, 1, 8, 9, . . ., 25 have lower probabilities (shorter bars) than x obs = 2 and are more favorable to H a .Thus, the H a support is the set of values {0, 1, 2, 8, 9, . . ., 24, 25}, and the p-value is the sum of the probabilities for the values of X in the H a support which is 0.2073, as shown in Figure 4(b).
Once students understand the concept and steps involved in computing the p-value, we suggest that they use standard proportion inference functions in statistical software to perform the computations.Figure 5 shows the output from the Rguroo One Population Proportion function resulting from the input shown in Figure 2. The output includes a table consisting of the p-value of 0.20735 and a graph that shows the null distribution marking x obs by a green triangle with red bars corresponding to the H a support.The legend includes a formula for how the pvalue is computed.
Figure 6 shows how to use the binom.test()function in R to compute the p-value for the exact binomial test.This Figure also includes the R output.The p-value in the R output agrees with that reported in Rguroo's output in Figure 5.

Confidence Intervals
Most introductory statistics textbooks construct a 100(1 − α)% confidence interval for one population proportion using where p is the sample proportion, and z * is the 1 − α/2 quantile of the standard normal distribution.Some approximate the standard error of p by replacing p into the formula p(1 − p)/n.A few textbooks use bootstrap to estimate the standard error of p or use the bootstrap distribution quantiles to obtain a confidence interval.We recommend introducing confidence intervals for one population proportion by inverting an exact binomial test (see e.g., Casella and Berger 2002).The method of inverting a test of hypothesis for obtaining a 100(1 − α)% confidence intervals was initially proposed by Clopper and Pearson (1934).To obtain the lower and upper confidence limits, they proposed finding all p 0 's for which we would not reject H 0 : p = p 0 versus two onesided alternatives, H a : p > p 0 and H a : p < p 0 , at the α/2 significance level.Blyth and Still (1983) proposed an alternative approach to inverting the two-sided test where one finds all p 0 's such that H 0 : p = p 0 is not rejected versus H a : p = p 0 at a significance level of α.They go on to show that their proposed interval is narrower and is less conservative than the Clopper-Pearson interval.Triola (2022) mentions the use of the Clopper-Pearson confidence intervals for small samples and notes that they are too conservative.He goes on to say that their computation is beyond the scope of his textbook and does not offer any further details or heuristics about the method.Tintle et al. ( 2021) introduce the Blyth-Still method.To get around its computational complexity, they use a grid of plausible p 0 values and perform two sided tests at each of the grid points, using simulation.The boundaries of the confidence interval are determined by the smallest and largest grid point values for which H 0 for the tests are not rejected.
We recommend using the Blyth-Still confidence interval because of its direct connection with the two-sided exact binomial hypothesis test presented in Section 3, as well is its more optimal properties.We explain to students that the interval is comprised of plausible values p 0 for the population proportion p in which we do not reject the null hypothesis.To present confidence intervals based on the formula given in (5), some textbooks simply give the formula and others resort to deriving the interval using probability statements and algebraic inequalities.Regardless of whether or not derivation of the formulas is included, this method does not provide the same conceptual understanding and connection to hypothesis testing as the inversion of a hypothesis test.
To introduce the method, we like the idea of using a grid of values as in Tintle et al. (2021).Once students understand the idea behind inverting a test, we leave the computation of the interval to software.Most software report the Clopper-Pearson interval as the "Exact Binomial" confidence interval.The reason for this may be that Blyth-Still intervals are more complex to implement.In the following example, we show how you can use Rguroo to obtain both the Clopper-Pearson and the Blyth-Still confidence intervals.
Example 4 (UCLA Example-Continued).As in Example 2, suppose that in a random sample of 25 students from UCLA, we observe two Hispanic students.Figure 7(a) shows Rguroo's dialog for obtaining a 95% confidence interval using both the Blyth-Still and Clopper-Pearson methods.The Blyth-Still method is the default method and is calculated by checking the Binomial(Exact) option in the Basics dialog.To obtain the Clopper-Pearson interval, the option Binomial(Exact-CP) is selected in Rguroo's Details dialog.

Teaching Mean Inference without Standardized Statistics
In many introductory texts, inference about a population mean is often introduced using the z-and t-based methods.To make inference about a population mean μ, the sample mean X from a sample of size n is used as the statistic of choice.Using the central limit theorem, for a sufficiently large sample, the sampling distribution of X is where σ x = σ/ √ n denotes the standard error of X with σ denoting the population standard deviation.Almost every textbook standardizes the statistic X by using the transformation When σ is not known, the estimate of the standard error σx = s/ √ n is used in place of the standard error in (7), where s is the sample standard deviation.
When the population distribution of the variable under study has a normal distribution and σ is unknown, the t-statistic is used, where t n−1 denotes the standard Student t distribution with n − 1 degrees of freedom.
In teaching mean inference, we have two recommendations: 1. Begin by assuming that the population distribution of the variable under study is normal.2. Avoid using standardized statistics such as T and Z, shown in ( 7) and ( 8), and use the distribution of X directly.
Recommendation 1 is motivated by the idea that students would not need to have knowledge of the central limit theorem nor would they need to be concerned with small or large-sample sizes while they begin to learn the basics of mean inference.We teach the central limit theorem only after introducing elements of hypothesis testing and confidence intervals for proportions and means assuming normality.In addition to reducing complexity, this delay makes us teach the central limit theorem when students have reached more maturity about statistics and sampling distributions.
Regarding the second recommendation, standardization is an age-old practice that is inherited from the times when probability calculators were not readily available, and we were forced to use tables for computing probabilities.As previously mentioned, surprisingly many textbooks continue to include probability tables.As we will show, putting aside this tradition and using a probability calculator from software reduces the steps needed to perform hypothesis test or construct a confidence interval.Furthermore, avoiding standardization allows for a conceptual understanding of inferential methods, since we work in the units of the data rather than the standardized Z and T statistics which are unit-less and cannot be directly interpreted in the context of a problem.
In the following two sections, we will outline the steps that we propose for hypothesis testing and developing confidence intervals for a population mean and include some examples.

Hypothesis Testing
We present the ideas in this section using a two-sided hypothesis test which has the form where μ denotes the population mean and μ 0 is a fixed hypothesized value.The methods that we present can be adapted to onesided tests where H a : μ < μ 0 or H a : μ > μ 0 .Although we find it easier to use one-sided tests when initially explaining the concepts of rejection regions and p-values, we choose to work with two-sided tests here since they are more commonplace in the real world.

Decision Based on a Critical Region: the Normal Distribution Case (σ Known)
Consider testing the hypotheses in ( 9) at a significance level α, where we assume that the population is normal and therefore, regardless of the sample size, the sample mean has the normal distribution shown in (6) .Let xobs denote the observed sample mean.Table (1) compares the steps required to obtain the critical region when using the standardized Z statistic (left panel) versus using the distribution of X directly (right panel).
As we see in Table 1, using the Z statistic involves five steps, whereas using the distribution of X directly requires three steps.In both methods we would introduce the sampling distribution X ∼ N(μ 0 , σ X ).Also, to obtain the critical values, both methods require the α/2 and 1−α/2 quantiles of the normal distribution.In the standardized case, we obtain the upper critical value z * using P(Z > z * ) = α/2, where Z ∼ N(0, 1), and the lower critical value is −z * (Step 3).In the direct method, we obtain the lower critical value x * L using P( X < x * L ) = α/2 and the upper critical value using P( X > x * U ) = α/2, where X ∼ N(μ 0 , σ X ) (Step 2).The two additional steps in using the Z statistic involve introducing the standardized statistic Z (Step 2) and computing Table 1.Comparing steps to obtain the critical region using the standardized Z statistic versus using the distribution of X directly for a two-sided test.
Step   When using the standardized Z statistic method, the required probabilities can be looked up in a probability table.While this was a major advantage in not too far past when probability calculators were not easily accessible, today spending the time to teach students how to use a probability table is no longer a good use of our class time.We can compute probabilities and inverse probabilities using probability calculators that are widely available in software.More importantly, the critical values x * L and x * U and the observed value xobs are in the units of the observed data.Contrast this with explaining the standardized value z obs and z * which are unit-less and do not directly relate to the context of the problem.Let's consider an example to illustrate this point.
Example 5 (SAT Example).According to the College Board's SAT Suite of Assessment Annual report, 1,509,133 high school students took the SAT exam in 2021.For these students, the mean for the math portion of the exam was 528 out of 800 with a standard deviation of 120.In Fall semester 2022, we asked our students at California State University, Fullerton (CSUF) to take a random sample of 50 first year students and use their sample data to investigate whether the mean math SAT score for students at the CSUF campus for that year was significantly different from the national average.The mean math SAT score for our sample was xobs = 565.Considering the standard deviation of 120 as the population standard deviation, we perform the following test at α = 0.05 level, where μ is the mean math SAT score for CSUF freshmen students who started in Fall 2022.An option is to use a probability calculator to obtain the critical values.However, to get a more detailed output, we use Rguroo's One Population Mean Inference function.The left image in Figure 8 shows Rguroo's Basic dialog, where we specify the required parameters to perform the test.The right image in Figure 8 is the Details dialog, where we have checked the options of P-Value Graph and Critical Region Graph to obtain graphs that can be used to visually explain the p-value and critical region.
Figure 9 shows a portion of the resulting Rguroo output report.Above the table, the alternative hypothesis is stated in words "Mean of CSUF SAT Math Scores is not equal to 528, " and the 2.5% lower and upper critical values of x * L = 494.74and x * U = 561.26are shown, respectively.The table consists of lower and upper critical Z scores of −1.96 and 1.96, respectively, and the standardized observed value of z obs = 2.18.

Now let's consider two possible explanations of these results to students:
Using standardized values: If z obs , the z-value corresponding to our observed sample mean, either exceeds the critical value 1.96 or is less than −1.96, we reject the null hypothesis.
In this example, our observed z value is 2.18 which falls in the critical region.Therefore, we reject H 0 and conclude that there is sufficient evidence at the 5% level that the mean math SAT scores of CSUF students is significantly different from the national average of 528.Using data-scaled values: If our observed sample mean xobs of math SAT scores is less than the lower critical value of 495 or exceeds the upper critical value of 561, we would reject the null hypothesis.Since our sample mean of 565 is larger than 561 and falls in the critical region, we conclude that there is sufficient evidence at the 5% level that the mean Math SAT scores of CSUF students is significantly different from the nation's average of 528.
The main idea here is to teach our students the concept of "significantly different" than the hypothesized value.In this example, it makes conceptual sense when we explain to our students that "if our observed sample mean for the Math SAT score is outside of the plausible range of 495 to 561, we would consider it significantly different than the hypothesized value of 528." Compare this to stating the rule using the standardized values of 2.18 and the range −1.96 to 1.96, which don't have a direct interpretation in the context of our problem.
Figure 9 shows two critical region graphs.The graph on the left shows the null distribution in the scale of the observed data, and that on the right shows the null distribution in the standardized z-scale.The red-shaded regions in both graphs show the critical region.The green triangle on the left graph points to the location of the observed sample mean 565, while that on the right graph shows the location of the standardized observed sample mean 2.18.Both green triangles fall in the critical (red) region, indicating that we should reject the null hypothesis.The graph on the left is helpful in explaining the concept of the critical region.Specifically, it shows the distribution of the sample mean X in the scale of the data, centered at the hypothesized null value of μ = 528.By looking at this distribution, students can see plausible values for the sample mean and make comparisons to the observed sample mean directly.On the other hand, the standardized graph on the right Table 2. Comparing steps to obtain the p-value using the standardized Z statistic versus using the distribution of X directly for a two-sided test.
Step  is centered at zero, which has no direct connection to the stated null hypothesis and fails to contextualize the variability of the SAT math scores.

Decision Based on a p-Value: The Normal Distribution
Case (σ Known) Table 2 shows the steps required in teaching and computing pvalues when using the commonly used standardized statistic Z versus the un-standardized X.Similar to the critical region case, the standardized case involves more steps than directly using X.As in the exact binomial case, to introduce the p-value for testing hypothesis about μ, we define the H a support as the Xvalues that are as favorable or more favorable to H a than xobs , provided that the null hypothesis H 0 is true; here, favorability is determined based on density values.Specifically, when looking at the probability density function of X ∼ N(μ 0 , σ X ), X-values for which the density is less than or equal to the density at xobs , are as favorable or more favorable than xobs to the alternative hypothesis.Then, as before, we define the pvalue as P( X belongs to the H a support).As we show in the next example, it is very helpful to explain these concepts using a pvalue graph.
Example 6 (SAT Example-Continued).Figure 10 shows the pvalue graphs for Example 5, the SAT Example.The left panel shows the p-value graph in the scale of the data and the right panel shows the p-value graph in the z scale.Again, because the graph on the left panel is in the scale of the data it can be used to explain the p-value conceptually, whereas the graph on the right hand side, which is in the standard scale is not quite as easily interpretable in the context of the data.So, we continue with the graph on the left.The green triangle points to the location of the observed sample mean of xobs = 565.The orange horizontal dashed-line is drawn at the height of the density at xobs .We explain to students that the H a support (indicated by the red-shaded region) consists of all X values (sample mean values) for which the density curve falls below the orange dashed-line; these values are as favorable or more favorable to the alternative hypothesis than xobs .Thus, computing P( X belongs to the H a support) amounts to finding the area of the red-shaded region.Note that by using this explanation, we justify the multiplication by 2 in the p-value formula for the two-sided test.
Figure 11 shows an R function for computing the p-value for a two-sided test with the normality assumption.We use this function to calculate the p-value for this example.This p-value Table 3. Comparing steps to obtain the critical region using the standardized T statistic versus using the distribution of X directly for a two-sided test.
Step 2: Obtain the lower and upper critical values such that P( Step Step 4: Reject H 0 , if t obs > t * or t obs < −t * . agrees with that computed in Rguroo and shown in Figure 10.The z.test() function in the BSDA package in R can be used to perform a z-test if raw data is available.

The t-Distribution Case
In our introductory courses, in addition to the general normal distribution with arbitrary mean and standard deviation, we introduce the standard t-distribution through the unit-less T statistic given in (8).Like the general normal distribution, there is also a general t-distribution with an arbitrary location (mean) and scale (standard error).Let σ X = s/ √ n denote the standard error of X, where s is the sample standard deviation based on a sample of size n.If T is a random variable which has a standard t-distribution with n − 1 degrees of freedom, then X = σ X T + μ has a t-distribution with location μ, and scale σ X .The probability density of X is given by f , where f T (x) denotes the probability density function of the standard t-distribution with n−1 degrees of freedom.Using this location-scale family result, if our data come from a normal distribution with unknown population standard deviation, we propose making inference about a population mean μ using the distribution where t n−1 (μ, σ X ) denotes the t distribution with location μ, scale σ X , and n − 1 degrees of freedom.The process of teaching the general t-distribution in (11) after students get familiar with the standard t-distribution is similar to that of transitioning from the standard normal distribution to the general normal distribution with an arbitrary mean and standard deviation.Based on our experience, introductory students have no problem understanding the locationscale family concept presented here.Much like using the general normal distribution instead of the standardized Z, using the general t-distribution instead of its standardized counterpart affords our students a better conceptual understanding by working with a statistic that is in the scale of the data.As we will show, computation of probabilities for the general t-distribution can be easily performed using software.
Again, consider conducting the two-sided test in ( 9) at a significance level α assuming that the population is normal with an unknown standard deviation.Table 3 compares the steps required for making a decision based on the distribution of standardized T statistics (left panel) and using the distribution of X directly (right panel).Like in the normal distribution case, there are fewer steps involved when avoiding standardization.Moreover, the decision in the last step is more conceptual when stated using the x * L and x * U that are in the scale of the data rather than t * that is on the standardized scale.The following is an example.
Example 7 (SAT Example-the t-test).Consider the hypothesis test in Example 5, and assume that the population is normal with sample standard deviation s = 120 based on a sample of size 50.In this case, our null distribution will be X ∼ t 49 (location = 528, scale = 120/ √ 50).( 12) Figure 12 shows Rguroo's probability calculator for computing the critical values x * L and x * U .In the probability dialog shown in Figure 12(a), we select the option Probability ⇒ To get a more detailed output we can perform this test in Rguroo's One Population Mean Inference function.The input for this case is exactly the same as that shown in Figure 8 with two exceptions; instead of filling the population standard deviation we fill in the Sample S.d. with the value of 120, and we select the option t-statistic.
Figure 13 shows the resulting output.This output consists of the critical region, a critical region graph, the p-value, and a pvalue graph.All of these quantities are reported both in standard form, using the T statistic, and in the scale of the data using the distribution of X directly.
Figure 14 shows an R function for computing the p-value for a two-sided t-test.The p-value for this example is computed using this function, and it agrees with that computed in the Rguroo report, shown in Figure 13.The t.test() function in R can be used to perform a t-test if raw data is available.
Note that the process of obtaining p-values using the distribution in ( 11) is a similar to that shown in Table 2. Changing Z to T and replacing the N(μ 0 , σ X ) with t n−1 (μ 0 , σ X ) in Table 2 forms the steps for obtaining the p-value in the t-distribution case.

Confidence Intervals
As in proportion inference, we recommend teaching confidence intervals for a population mean after teaching hypothesis testing.Assuming a normal population, the most commonly used method for obtaining a confidence interval is X ± margin of error, where X is the sample mean, and the margin of error is z * × σ X , if σ is known or t * × σ X , if σ is unknown with z * and t * denoting the 1 − α quantile of the standard normal and the Student t-distributions, respectively.For the most part, this formula is simple for students to use, and confidence intervals for a population mean can be computed easily by hand or using software.To interpret a confidence interval, we typically teach students to use the template: "We are 100(1−α)% confident that the true population parameter lies between the lower bound and the upper bound." Beyond these formulas and routine interpretations, it is important for students to get a more thorough understanding of a confidence interval.For instance, students should be taught the classical interpretation that a constructed confidence interval would contain the true mean 100(1 − α)% of the time in repeated experiments.To demonstrate this, we often use applets and simulations.Applets are very useful in teaching the classical interpretation of confidence intervals, but beyond mechanically rerunning the applets, it is difficult to devise elementary-level exercises for introductory students that drill down the idea.Again, we recommend using the concept of inverting a test to obtain confidence intervals for a population mean.As we explain, introducing this inversion method provides students with a hands-on opportunity to construct confidence intervals and develop a conceptual understanding for them.
For a given value of x obs , we describe a 100(1−α)% confidence interval for the population mean μ as all values of μ 0 for which the two-sided test in ( 9) is not rejected.As a hands-on activity, we ask students to construct a confidence interval using a given value of xobs and σ X by creating a grid of μ 0 -values around xobs and performing repeated tests at each value of the grid.A confidence interval for μ in this case would be the grid boundaries for which we do not reject the null hypothesis.The following is a specific example that we give to students to work on in groups.
Example 8 (Inverting a Test-Group Activity).Consider a situation where the sample size n = 100, xobs = 10, and the population standard deviation σ = 20, which leads to σ X = 20/ √ 100 = 2.We ask our students to perform the test of hypothesis in ( 9) at a significance level α = 0.05 for the following values of μ 0 : 5, 6, 7, 8, 10, 12, 13, 14, and 15 and form a table of p-values for each μ 0 .Then, students would determine for which values of μ 0 the hypothesis H 0 : μ = μ 0 is rejected.Based on this exercise, we ask our students to guess the greatest value less than xobs and the smallest value greater than xobs for which they would not reject the null hypothesis.We hint that the smallest and the largest values may not be one of the μ 0 values that they have tested.
Students should obtain the following p-value As seen from the table, H 0 is not rejected at the α = 0.05 level for the values 7, 8, 10, 12, and 13, and it is rejected for the values 5, 6, 14, and 15.In obtaining a guess for the lower-and upperbound of the confidence interval, most student groups in our classes are able to deduce that the lower-bound value should be between 6 and 7 and the upper-bound value should be between 13 and 14.At this stage, we ask groups to present their guesses.Then, we compute the confidence interval using the formula in (13).For the most part, the guesses should be fairly close to the actual confidence interval bounds.This gives us an opportunity to discuss the difference between the actual 1.96 multiplier and the values of μ 0 , 6 and 14, that were exactly 2 standard errors away from xobs .In our experience, this has been an engaging activity that helps students understand the relationship between confidence intervals and tests of hypotheses.
Depending on the level of students in your class, you may further exploit the duality between confidence intervals and  9) is not rejected.On the other hand, the hypothesis test fixes the parameter μ 0 and asks for what values of x obs we do not reject H 0 (i.e., what is the acceptance region?).This concept is illustrated in Figure 15, where the shaded blue is the region where Therefore, for a given μ 0 , as shown on the horizontal axis, we get the acceptance region, shown on the vertical axis, depicting all X for which the null hypothesis H 0 : μ = μ 0 is not rejected.
Analogously, the shaded blue region can be thought of as the region where Therefore, for a given xobs , as shown on the vertical axis, we get the confidence interval, shown on the horizontal axis, depicting all μ 0 values for which the null hypothesis H 0 : μ = μ 0 is not rejected.

Summary and Discussion
Introductory statistics courses have adapted to technological advancements over the years by including instructions on how to perform statistical analyses using software.To use the full capabilities of statistical software, we should not simply use software as a means of avoiding by-hand computations but take advantage of its potential to demonstrate and teach statistical concepts.The following roadmap for teaching proportion and mean inference aims to achieve the latter in line with GAISE guidelines to place emphasis on conceptual understanding: 1. Teach discrete random variables, and in particular the binomial distribution.2. Teach concepts of variability utilizing in-class activities and simulating from a binomial random variable.3. Introduce elements of test of hypothesis (null and alternative hypotheses, Type I and Type II errors) in the context of the one-population proportion problem.4. Use the exact binomial test to perform tests about a one population proportion. 5. Use the idea of inverting a test to obtain confidence intervals for a population proportion and use software to obtain the Blyth-Still confidence intervals.6. Teach continuous random variables and in particular the general normal and t distributions.7. Teach mean inference, assuming a normal population distribution, and use the distribution of the sample mean directly; avoid standardized statistics, the central limit theorem, or simulation-based methods.8. Teach the central limit theorem using simulation and applets.9. Adapt the normal-theory inference methods to the nonnormal case by utilizing the central limit theorem result.
It is important that students get familiar with the concept of sampling variability before learning statistical inference.As noted by Chance and Rossman (2001), the context of proportion inference is helpful in teaching sampling variability since simulation of binary data, whether through class activities or via computer software, is relatively easy to implement and understand.
Using the exact binomial method when beginning to teach hypothesis testing provides a smooth segue into learning inference concepts since the binomial distribution is often introduced early on in introductory courses.Furthermore, using this method gives us the opportunity to present inference in the context-friendly setting of count data and allows us to put aside any assumptions on the sample size n and proportion p that one would have when using the asymptotic distribution of p.
Teaching the Blyth-Still confidence intervals naturally follows from using the exact binomial test in proportion inference.Although we need to use software to compute this interval, it allows us to talk about inverting a hypothesis test, which provides students with a conceptual view of a confidence interval.As a side note, since the distribution of p is symmetric or nearly symmetric for large samples, half of the width of a Blyth-Still confidence interval gives a good measure of the margin of error.In small sample sizes, the distribution symmetry is not guaranteed, and one cannot obtain an interpretable margin of error.
Regarding mean inference, standardized statistics such as z and t obscure the context of inference problems.By teaching mean inference using the distribution of the sample mean X directly, we allow students to work with a more tangible statistic that is in the units of the data.Furthermore, this reduces the number of steps required to perform a hypothesis test for a population mean.To teach mean inference, we begin by assuming that the population is normal.This prevents students from being distracted from unneeded concepts such as the central limit theorem or needing to understand simulation machinery and methods.Only when students have gained more maturity with sample variability through basic methods should we resort to simulation to introduce the central limit theorem.
It is fair to ask how these methods would extend to the twopopulation case where we compare proportions or means.We can resort to Fisher's exact test to teach an exact method for the two-population proportion problem.However, we like the permutation test advocated by GAISE (2016) because it can be introduced nicely through class activities and is intuitive.Permutation tests are included in the Common Core State Standards and have been part of a number of states' curricula for many years.As for mean inference, the one-population methods described in Section 3 can be generalized to the two-population case where we use the distribution of the difference of sample means for the two populations directly instead of standardizing them.This will have similar advantages to those we discussed for the one population case.
The methods discussed in this article emanate from the first author's many years of experience in teaching introductory courses and the second author's perspective as a student.The first author's department offers a general education introductory statistics course that is taken by students majoring mainly in the natural sciences, in particular biology, chemistry, and geology.In teaching this course, he has implemented the approach outlined in this article in the past few years.While no formal study has been conducted to compare the proposed methods' effectiveness to more commonly used or simulation-based methods, student feedback has been overwhelmingly positive.Also, as an indication that the approaches proposed are effective, students have been performing significantly better in answering conceptual questions given in their exams as compared to when the first author used more traditional approaches to teach inference.

Figure 1 .
Figure 1.Using Rguroo's probability calculator to determine the critical region for the test of hypothesis (4).
Figure 2 shows the dialog boxes for this function.The left panel shows the Basics dialog where we have labeled the Factor as "Student Ethnicity, " and the Success as observing a "Hispanic." The values of Sample Size, n = 25, # of Successes, x obs = 2, Alternative Hypothesis, H a : p = 0.2, and the Significance Level, α = 0.05 are specified.To conduct the exact binomial test, we select the Binomial option.The right panel shows Rguroo's Details dialog where the Critical Region and P-Value checkboxes are selected to obtain the corresponding graphs.

Figure 2 .
Figure 2. Using Rguroo's proportion Inference function to obtain the critical region and p-value graphs for the UCLA example.

Figure 3 .
Figure 3. Critical region graph for the UCLA Example.

Figure 4 .
Figure 4. Determining the H a support and computing the p-value for the UCLA Example.

Figure 5 .
Figure 5. Rguroo's One-Population Proportion output for the UCLA example.

Figure 6 .
Figure 6.R's binom.test()output for computing the p-value for the UCLA example.
(b) shows Rguroo's output.The Blyth-Still confidence interval is (0.0144, 0.2559) which is narrower than the Clopper-Pearson interval (0.0098, 0.2603).It is worth noting that the binomial.test()function in R computes the Clopper-Pearson interval and does not have an option for computing the Blyth-Still confidence interval.The Clopper-Pearson interval is shown in the R output in Figure 6 and agrees with that computed by Rguroo.

Figure 7 .
Figure 7.The binomial confidence intervals for the UCLA Ethnicity example.

Figure 8 .
Figure 8. Rguroo dialog for specifying parameters for the SAT Example.
z obs , the standardized value of the observed statistic xobs (Step 4).

Figure 9 .
Figure 9. Rguroo output showing critical region graph for the SAT Example.

Figure 10 .
Figure 10.Rguroo output showing p-value graph for the SAT Example.

Figure 11 .
Figure 11.An R function for obtaining the p-value for the two-sided normal-theory-based test with the output for the SAT Example.

Figure 12 .
Figure 12.Calculating the critical region for Example 7.

Figure 14 .
Figure 14.An R function for obtaining the p-value for the two-sided t-test with the output for the SAT Example.

Figure 15 .
Figure 15.Relationship between confidence interval and acceptance region for one population mean inference.