When I first started doing research, about 65 years ago, every research institution had its own statistics department. It was the duty of the statisticians to make sure that every experiment was correctly designed, and analysed, as well as supervising the technical staff who tediously calculated sums of squares on the mechanical calculators in use at that time. While we may have teased the statisticians by calling their department the “Star Factory”, we nevertheless appreciated their wisdom and guidance. Thank heavens people no longer need to pound calculators. Such tedium has been taken over by computers. However, in many cases this has resulted in statisticians being seen as unnecessary, and perhaps the teaching of statistics has been neglected.

Similar considerations prompted Webster (2007) to write: “there has grown up a new generation of soil scientists who are as mystified by the analysis (of variance) as those of 30 years ago. Yet, whereas then the scientists did not do the analysis if they did not understand, now, with so much software around, they can do it at the press of a few buttons. They thereby obtain results, but their presentation of them and their discussions that follow suggest to me that in many instances they do not understand what they are doing.”

I illustrate some of the problems that arise using Table 1. I have imagined an experiment in which responses to phosphorus are observed at two levels of lime for a very acidic soil. The soil is so acid (say pHCaCl2 < 4.5) that growth is limited by aluminium toxicity; the level of lime used raises pHCaCl2 to near 5.5, sufficient to alleviate aluminium toxicity but not sufficient to enter the region in which high pH decreases the rate of uptake (Barrow et al. 2020). This table does not come from any real manuscripts, but is an amalgam of Tables I have seen in many submitted manuscripts, and in quite a few published ones.

Table 1 Illustration of the way results are often presented. Values are for imagined yields for an experiment in which response to five levels of phosphate was studied at two levels of lime. You may imagine whatever dimensions you choose

There are several problems with this kind of presentation. One is aesthetic; I find it ugly. It is also inefficient because, in practical terms, one can hardly see the results because of all the extra material. As Webster (2007) wrote: “The most important outcomes of almost all experiments are the means for the treatments… They should have pride of place in the results section of any paper or report”.

Now let us consider the alphabet soup that decorates such tables. The star factory of earlier times has become an alphabet factory. Methods for making multiple comparisons is a large and complicated subject and one about which many statisticians disagree. Several methods including those of Duncan (1955 ) and Tukey (1953) as quoted by Benjamini and Braun (2002) are based on the argument that as the number of treatments is increased, the number of possible comparisons is also increased and the possibility of detecting a false positive is increased. Therefore, the criterion for detecting differences should be more stringent. That would be sensible for an experiment intended to “pick a winner”. An example might be comparing different varieties of a crop for which there is no initial hypothesis and it might be reasonable to compare everything with everything else. However, for most experiments this is not the case; there is an initial hypothesis. Inclusion of an extra treatment, such as an extra level of phosphorus (P) in Table 1, might be expected to improve the ability to test the hypothesis, rather than to decrease the sensitivity. Again, Webster (2007) has an appropriate comment: “One must ask why the inclusion of more treatments in an experiment should diminish the power of a test to detect true differences…: it does not make sense practically.”

There is another common problem which is also illustrated in Table 1: the presence of “±” signs after each entry followed by a number. If the data are represented in figures there will similarly be an error bar of different lengths against each entry. There is a logical inconsistency in this custom. One of the most basic assumptions of an analysis of variance is equality or homogeneity of variances – the variance of data in groups should be the same. If this is not so, there are various transformations made to the data. If errors are indeed random, for some treatments the replicates will give very similar values, but for others the values will not be so close. It does not follow that treatments for which the replications give similar values are more accurately measured than others for which they don’t. If errors are indeed random this is what you should expect. Yet again, Webster (2007) has an appropriate comment: “in an experiment in which all the treatments are replicated equally the SE calculated as above is a single value applying to all means. You should quote it rather than SEs calculated separately from the data for the individual treatments or classes.”

Quite often, instead of being presented in tables, response data are represented by column graphs (Fig. 1). I do not understand why these are so popular. It seems to me that when the interval between each dataset is not constant, it is difficult to really get a picture of the results when they are presented this way. The effects of the treatments can be much more readily visualised when the data are presented as in Fig. 2. Further, these days it is not difficult to fit non-linear curves to describe such data. This method of data presentation brings out the effects that are difficult to see when the other methods of presentation are used. At low pH, application of lime stimulates growth, but decreases the amount of P derived from the soil, as indicated by the dashed extrapolations to the horizontal axis.

Fig. 1
figure 1

Illustrating the way the data of Table 1 is often presented – as column plots with the levels of P represented as symbols and the columns evenly spaced

Fig. 2
figure 2

Dot plots of the data presented in Table 1 and Fig. 1. The lines indicate the fit of a version of the Mitscherlich equation expressed as: y = m [1 – exp.(−c(x + d)] where y is the yield, x is the P applied; and m, a, and d are parameters. The parameter m indicates the maximum yield to which the data trend; the parameter c indicates the slope of the response curve; and the parameter d indicates the P coming from the soil and seed; its magnitude is indicated by the extrapolations of the lines to the horizontal axis

With data such as these, why is there any need to compare individual means? Within each lime treatment, there is a highly significant regression. We can therefore assume that any change in the independent variable will produce a change in the dependent variable, the magnitude of which depends on the fitted relationship, and with a level of confidence that depends on the correlation coefficient. That leaves the question of distinguishing the response curves. In order to do this we set up the null hypothesis that they are not different and in that case a common curve adequately describes the results. We then compare the residual sums of squares for the improvement obtained by fitting separate curves. Table 2 shows that the null hypothesis can’t be sustained; the curves are different. However, with only five levels of applied P and with equations containing three terms, there are insufficient degrees of freedom to provide a sensitive test.

Table 2 Illustration of a procedure to test whether the lines in Fig. 1 are statistically different

How one should choose treatments when planning to compare responses; is it better to have for each treatment, say, 10 individual points or five points that are the means for two observations? There are some general considerations. It is always better to explore the response over as wide a range of observations as possible and so the more levels the better. Further, there is no need to replicate observations when it is not intended to compare individual observations, but rather to compare response curves. In order to explore how to choose treatments more thoroughly, I wrote a small BASIC program (online resource 1).

The program first generates random errors in the observations with a normal distribution and specified standard deviation. Each group of 20 generated values was analysed in two different ways. In one of these ways, pairs of values were used to generate means and the resulting 10 observations were assumed to represent treatments as indicated in Fig. 3a. In the other way, individual values were used and the resulting 20 observations were assumed to represent treatments as indicated in Fig. 3b. The improvement produced by fitting separate curves rather than a common curve was then analysed as illustrated in Table 2, and for the resulting pairs of values for the variance ratio, and the p value was calculated. In order to generate smooth frequency distribution curves, I used 5000 sets of randomly generated groups of 20 variables for each value of the standard distribution. The frequency distribution of the values for p was calculated using LibreOffice Calc and the curves plotted using Sigmaplot.

Fig. 3
figure 3

Two ways treatments may be chosen when comparing two curves. In both cases, 20 observations are made; in part a, there are 10 means of 2 observations in part b, 20 individual points

Figure 4 shows that the p values are on average much lower when 20 individual values are analysed. This is mainly because there are more degrees of freedom against which to test significance. As the “experiment” becomes less precise – that is as the standard deviation of the error increases, both tests become less sensitive and the tails of the distribution may overlap. Nevertheless, on average, a more sensitive test is obtained when 20 individual values are used. Thus, in contrast to he generally accepted wisdom, you are more likely to distinguish curves if you devote your effort into more levels rather than more replications.

Fig. 4
figure 4

Effect of the precision of an experiment as simulated by varying the values for the standard deviation (sd) of the distribution of errors. Values of p are for the significance of the difference between two response curves each of which is based on 10 observations. In one case there are 10 individual observations per curve; in the other, the 10 observations are 5 pairs of duplicates giving 5 means of two observations. . The vertical line indicates p = 0.05

In summary: “In experiments with graded treatments do not make multiple comparisons of any kind; instead fit a response curve and analyse the data by regression” (Webster 2007). To which I would add, if you are not to compare means, it makes no sense to replicate observations. If you do so, you lose sensitivity.