A tale of two tails: Do Power Law and Lognormal models fit firm-size distributions in the mid-Victorian era?

• We test the tail behaviour of mid-Victorian firm-size distributions. • Pareto model fits the tail of the size distribution of firm size for 1851 and 1861. • The


Introduction
Fujita et al. in their classic textbook observe that often ''theory gives simple, sharp-edged predictions, whereas the real world throws up complicated and messy outcomes'' [1].However, Fujita et al. also notice that numerous examples occur where the situation is upside down: ''data offer a stunningly neat picture, one that is hard to reproduce in any plausible (or even implausible) theoretical model'' [1].This unsettled theorisation may explain that a search in Physica A gets 7543 results for ''Power law'', 551 for ''Pareto distribution'', 222 for ''Zipf's Law'', 708 for ''Lognormal'' and 70 for ''Gibrat's Law'', the main fields discussed in this paper.In this literature on firm size, the key sources are the work of American linguist George Kingsley Zipf [2] and French engineer Robert Gibrat [3].However, Power law, Pareto distribution, Zipf's law, and Gibrat's law have been described in an astonishing number of occasions in the social sciences, economics, biology, linguistics, history, geography, ecology, physics, and elsewhere. 1his paper focuses on firm size.This has been previously explored using different variables according to the frequency, number or percentage of total employees, employment share, sales, income, assets or profits.The unique data examined in this paper permits analysis of both the distributions of the frequency, and the size, measured by number of total employees of firms in England and Wales between 1851 and 1881.The relevant literature on the distribution of firm size covers almost all geographical contexts: e.g.[15] for the United Kingdom, [16] and [17] for the United States, [18] and [19] for the G7 countries, [20] for Japan, [21] for Italy, [22] for Portugal, [23] and [24] for China, or [25] for Korea.A related topic, farm size, which is also available in our dataset but is not analysed in this paper, is included in works by [26] and [27] for England and Wales, [28] for South Africa, and [29] for England and the EU.An important contribution to the discussion is the book [30] where complex stochastic processes are used to explain both Zipf's and the Gibrat's laws.Advanced econometric techniques are implemented in this paper to test the tail behaviour of our unique sample.The main intention of the paper is to provide evidence of the statistical behaviour of the firm size and frequency distributions using a dataset for England and Wales in mid-Victorian times.
Currently there is a heated debate about the possibility to distinguish whether behaviour at the tail follows a Power Law distribution or a pure Gibrat's Law, i.e. a Lognormal distribution.Malevergne et al. [31] claim that their attempt using the UMPU (uniformly most powerful unbiased)-following Del Castillo and Puig test [32]-''is shown to provide a clear diagnostic, allowing us to distinguish between the power-law and the Lognormal hypothesis, even when the data set is quite small''.Also, although Clauset et al. [5] argue that: ''[i]t is extremely difficult to tell the difference between log-normal and power-law behaviour . . .[and] it appears unlikely that any test would be able to tell them apart unless we had an extremely large data set'', notwithstanding they provide an extended methodology to deal with the testing of the Power Law and Lognormal hypotheses.Bee et al. [33] and [34] also acknowledge that ''the two different distributions are mathematically different, but only in the limit'' and the tail behaviour for their empirical analysis gave approximately the same outcomes for their finite samples but at the same time they provide a Maximum Entropy (ME) test to tackle this query.In this paper, we implement several advanced statistical methods proposed in recent literature using maximum likelihood estimation to estimate the shape (α) and the cut-off (u) parameters, loglikelihood ratio to compare competing models, Vuong methodology to find p-values for this comparison, clipped sample coefficient of variation UMPU-Wilks test to contrast the Pareto to the Lognormal hypothesis on the tail, Kolmogorov-Smirnov procedure to measure the distance of models and find also the u parameter and the Goodness-of-Fit, and non-parametric bootstrapping to obtain confidence intervals as proposed by Malevergne et al. [31], Clauset et al. [5], Alstott et al. [35], and Gillespie [36].The evidence we provide corroborates the claim of Malevergne et al. [31] that ''the test can be derived with extremely high accuracy (even for very small samples)''.
We are interested in which of either a Power Law or Lognormal better describes the distributions of our data, but to aid interpretation it is important to clarify why firm size follows these distributions?Traditionally for examination of the whole distribution focus was given to the simple distinction of whether a firm's decision-making is multiplicative and not additive.When a decision is made on the next period of workforce hiring, it is based on ''I need ten more or twenty fewer workers'', or it is based on ''I need an increase of 10%, or I need a 20% decrease''.The former was considered to generate a normal distribution, while the latter was believed to generate a Lognormal distribution.As a result, many authors explain the outcomes of the Lognormal simply as a proportional process that acts recursively over time.For example, Gan et al. [37] using Monte Carlo experiments showed that no explanatory economic theory is required.Batty [38] argues that this result from complex systems is self-organising and Corominas-Murtra and Sole [39] establish that Zipf is a common statistical feature when a system goes from order to disorder or vice versa.Arshad et al. [40] give a wider comparison of these simple explanations based on economic shocks, human capital accumulation, or central place theory.At the same time, Gabaix [41] and [42] presents an explanation for the growth model of cities.Nevertheless, more recently Saichev et al. [30] and [43] have shown that simply adding a well-behaved noise to the well-known equation embodying Gibrat's law can result in switching the Lognormal behaviour to a Power Law for large values in the tail.That is, there is no need to add a theory just to explain the generation of the Lognormal distribution, as commonly argued, but also for the Power Law.

Methodology, data, frequency and size distributions
The data for this paper are derived from the original manuscripts of the England and Wales population census.The censuses have been transcribed and encoded into a database recently made available at the UKDA [44]: The Integrated Census Microdata (I-CeM).From this source as well as additional census data to infill gaps in I-CeM, employers and the size of their workforce have been extracted and made available as the 'British Business Census of Entrepreneurs' database. 2This provides the starting point here.The 1851-81 population censuses questions included data on workforce size.Unfortunately, the question was discontinued after 1881 so our analysis is limited to four-censuses: 1851-81.The data are particularly valuable since they are exhaustive, especially for the largest firms: noted as essential by Fujiwara et al. [46] to discriminate between firm size distributions.However, whilst the data for 1851, 1861 and 1881 are almost fully complete, 1871 has transcriber deficiencies that mean that total business numbers are not reliably available (about 20% of data are missing).Hence, whilst 1871 analysis is included here for completeness, and it generally confirm the same distribution patterns, we do not always attempt to fully interpret this year.
There are two possible distributions that can be built from the employers' answers in the censuses.One is the frequency distribution of firm size, the number of firms for a given workforce size.For instance, in 1851 there were 181,900 firms and 390 different sizes.Their distribution starting from the left (largest) has 159 (40.77%) of frequency one-159 sizes appear just once-52 (13.33%) of frequency two-52 sizes appear twice-and 23 (5.9%) of frequency three-23 sizes appear thrice-and so on, ending to the right with one (0.26%) of frequency 38,111 (i.e. for size of two employees), and one (0.26%) of frequency 47,231 (for size of one employee).This is similar to the number of journal citations used in the data by Redner (1998), as well as many other examples.The second distribution is the firm size distribution, the workforce size for a given firm.For instance, in 1851, there are 1,185,602 employees and 181,900 firms and its distribution starting from the left has 47,231 (25.97%) with one employee, 38,111 (20.95%) with two employees, and so on, ending to the right with one with 4287 employees and one with 6000 employees.As a starting point, it is possible to plot both together.Fig. 1, which is a replication of Figure 1 of Redner [14] for our firm data, using a log-log scale to represent the relation between frequency and size.These plots represent frequency as a function of size.However, the function is not one-to-one because for some low frequencies there are many sizes with the same given frequency, as can be seen from the large cluster of data points at low frequencies and relatively large sizes.This is amplified in our data because of bunching: contrary to Redner's data, employers often reported their workforces in round numbers of tens, hundreds and thousands.This explains the higher variance and spread in frequency at large firm sizes.Note also that we use the actual firm size for each firm: they are not pre-grouped into size classes.This overcomes one of the criticisms of many analyses of firm size: what Bee et al. [33]

Table 1
The ten most frequent firm sizes in England and Wales for each census year.Source: Censuses 1851-81.
Note: F 1 /F i denotes the ratio of each frequency relative to the frequency of the most frequent firm size.Grey cells, in 1851, show that the function converting rank to size (and its inverse) is not necessarily monotonic.term ''the widespread practice of binning the data'' (grouping into size categories).This loses information and results in any statistical inferences being less reliable, especially for the tails which Fujiwara [46] observe are critical: where bins are large and number of observations is small over a wide size range.
Tables 1 and 2 show the most frequent and the largest firm sizes for 1851-1881.First notice that, in grey, the function converting rank to size (and its inverse) is not necessarily monotonic.Also, in 1851 the ratio of each frequency relative to the frequency of the most frequent firm size ranges from one to fifteen, but in the rest of the years it ranges approximately one to eight.If we plot the frequency and the size distribution for values of less than forty for each distribution, in Fig. 2, more than 40% of the distribution is made up by firms with one specific workforce category, with a large drop of frequency for size classes with two specific size classes and over.Also, in Fig. 3, over 20% of the firms have size one, but the drop for smaller sizes is smooth and not as sharp as for frequencies.However, for both distributions what matters most for the mathematics of the curve is the upper tail of the distribution where high figures appear.Fig. 4 shows the whole distribution of firm size frequency.For the upper tail there are only 159 of firms including the four largest firms (2010, 2180, 4287, and 6000 employees).Hence, the distribution does not have a typical scale or size but spans from frequency one to frequency almost 50,000 as described by Newman [6]: ''not all things we measure are peaked around a typical value.Some vary over an enormous dynamic range, sometimes many orders of magnitude''.In the US population, Newman calculates that the ratio of largest and smallest population is at least 150,000.This compares with our almost 50,000 between the most frequent and least frequent firm size for the approximately 500 different frequency categories of firms.For the size distribution, this compares with approximately 8000 as the size range spans from one to 8000 employees if we expand the analysis to all the distributions.Table 3 compares in detail the summary statistics of the two distributions in each of the four censuses.While the frequency distribution has only around 500 datapoints, the size one has about 180,000.As a result, in the frequency distribution the arithmetic mean is of a different order of magnitude than the geometric and harmonic means and the median.In contrast, in the size distribution all are similar and close together.All statistics for maximum, range, inter  quantile range, standard deviation (and variance) and standard error of the mean of the frequency distribution are greater than for firm size.While the size distribution has a greater skewness and kurtosis, both distributions exhibit similar coefficients of variation.Table 4 shows the correlations among the frequency and size distributions for each of the four censuses.At first sight, there is a high correlation within the distributions: the four frequency distributions are correlated at about 0.99, while the four size distributions are correlated at 0.95.The correlations across types are also high at 0.7 and above.

Empirical analysis
A relationship y = f (x) follows a Power-law if, after the unit increase of x, y increases by a power.This Power Law is an intermediate rate of increase between the linear and the exponential.If you take the logs of the dependent variable and also the logs of the independent variable, there is a straight line relationship.The exponential, a model with a higher rate of growth just requires the logs of the dependent variable to get a straight line.Thus, finding the linear behaviour after taking logs was one of the holy grails for identifying the type of growth, but as we will see this does not suffice as a test.Indeed, a straight line is only a necessary but not a sufficient behaviour of a Power Law.In our cases, the Lognormal presents an exponential decline while the Power Law just a power decay.Consequently, the Lognormal goes to zero faster than the Power Law.Nevertheless, it is also well-known that for certain parameter values, the Lognormal can present pseudo-linear patterns in the meaningful intervals of analysis.As shown by Malevergne et al. [31] a Lognormal with standard deviations in the interval [2,3] ''is close to linear over almost four decades'' of city sizes over time.This is observed in our data.Thus it is of crucial importance to use other more advanced methods to distinguish between the Power Law and the Lognormal in the dataset.Consequently, we have two competing behaviours: Pareto (a Power Law with a cut-off-the x min or u parameter-defining the tail as shown below) and Lognormal distributions. 3 The probability density function of the Pareto distribution as presented in equation 2.2 of Clauset et al. [5] is: ) −α while Malevergne et al. [31] equation ( 21) use: It is immediate that: This means that when we are saying Pareto we are also saying ''Power Law with a cut-off-the x min or u parameter-''.So these terms are both interchangeable, but also subtlety different.The title of the paper uses Power Law because it is the more general terminology but the tail behaviour is just restricted to Pareto because we are thinking of a ''Power Law with a Lentement term equals to x min ''.At the same time, when we explore the Zipf behaviour we are thinking of a ''Pareto with an exponent equal to 1'' or a ''Power Law with a cut-off and an exponent equal to 1''.

Stats
From this general equation, one can impose an additional constraint that L(x) = x min so that the Pareto distribution is defined.Finally, if the parameter α 0 in Eq. (3.1) is equal to 1 a Zipf distribution is defined.Notice, importantly, that the coefficient in the CCDF is α 0 , as defined in Eq. (3.1), so that Zipf's law is fulfilled when Malevergne et al.Sometimes this relation is firstly presented, as we do in Figs. 5 and 7, as a linear relation between the rank, r, and firm size, x, in a log-log plot.But it is important to admit that a linear relation is only a necessary condition for a Power Law behaviour but not a sufficient condition, as mentioned before.Hence, a linear regression does not corroborate a Power Law because Lognormal, exponential (see [5]) and other distributions can manifest themselves as linear in a rank/size log-log plot.The following equation, Equation (2) of Hoon et al. [25], further describes this relation: which can be interpreted as ''for any two companies in the ranking, the difference in their sizes is proportional to their difference in ranks and β is the proportionality constant''.In other words, that the elasticity of the relation between rank and firm size is constant.
The Lognormal model has two parameter µ and σ .According to Johnson et al. [50], x is Lognormally distributed with parameters µ and σ , if the logarithm of x is normally distributed with the same parameters.The probability density function is the following: , x > 0 Limpert et al. [51], and [52] argue that ''[s]kewed distributions are particularly common when mean values are low, variances are large, and values cannot be negative''.For the difference between Lognormal and normal variability, they say ''[a] major difference . . . is that the effects can be additive or multiplicative, thus leading to normal or lognormal distributions''.Limpert et al. [51], as discussed in the Introduction, show how the Lognormal distribution is generated from a sequence of multiplicative random effects, x × c and x c , where x is a given value of the distribution and c any constant, equivalent to the additive random effects that generate a normal distribution, x + c, and x − c.They add that, as in the normal distribution, an additive 68-95-99.7 rule applies (i.e., the arithmetic value plus and minus one, two, or three standard deviations), so also in the Lognormal distribution a multiplicative 68-95-99.7 rule emerges for iterative multiplication and division by standard deviations.

The u cut-off parameter
The first step to test the behaviour at the tails is to establish at what point each tail starts.To the left of the u parameter the behaviour is assumed to be Lognormal.To the right there is competing behaviour between the Pareto and the Lognormal.To establish the u parameter we use the Clauset et al. methodology described in [10] and [5]: they choose the u that minimises the distance between the Power Law model and the empirical data.To measure the distance they use the maximum distance between the cumulative distribution function (CDF): is the CDF of the data for the observations with value at least u, and P(x) is the CDF of the Power-law model that best fits the data in the region x ≥ u. u is then the value that minimises D [5].Clauset et al. use the max or infinity metric, d ∞ , i.e, the so-called Chebyshev distance in the space of functions.They suggest that this Kolmogorov-Smirnov (KS) statistic is well behaved for this model.They even show that this method performs better than a Bayesian information criterion (BIC) another alternative they implement.Malevergne et al. [31] propose a Maximum-likelihood estimation (MLE) method maximising the likelihood of a piecewise-function of a Lognormal below u and a Pareto at and above u.We tested our results against this methodology with similar u estimates.Bee et al. [33] add a ME methods, that is the best approximating density in a non-parametric setup.Generalising it can be said that what is needed is a point estimation method to infer the most statistically plausible estimate that accommodates the Lognormal model to the left and the Pareto one to the right of the estimate.After calculating u, it is direct to use another MLE estimation to fit the Pareto tail to find the α parameter.Finally, using a non-parametric bootstrapping procedure, i.e. by repeatedly subsampling from the empirical distributions, we can find 95% confidence intervals for each described parameter.Figs. 6 and 8 show the fitted CCDF distributions with a dashed line going from the estimated u throughout the tail with a slope equal to the estimated α for all the distributions.We use different methods to implement each statistic including code by Clauset [53] for MATLAB [54], Alstott et al. [35] for Python [55], and Gillespie [36] and Shalizi [56] for R [57], and our own code to implement the methods not readily available.Our results for the number of firms and the total employees at the u parameter for the eight distributions under analysis are presented in Table 5 and the 95% confidence intervals in Table 6 and Figs. 9 and 10.It can be seen from the results that both reported methods, Clauset et al. and Gillespie, provide the same outcomes: for the small sample frequency distributions, the û cut-off includes the whole distribution (but for 1851 this starts at a û of three for the frequency of firms), while for the large sample size distributions the cut-off is evenly at a û of ten employees.The α estimations of the shape parameter are also given.In Table 6, it is also shown the number of observations within the tail or Total obs−rank( û) and the percentage of firms or employees respectively in the frequency and the size tails, which range from 53% to 100% for the frequency tail and from 60% to 77% for the size tail. 4

A proper Pareto test
Hypothesis testing, following Casella and Berger [58], is an inferential method where a statement about a population parameter is assessed to decide based on a sample of the population, which of two complementary hypotheses is true: the null, H, and the alternative hypotheses, K.A test is a function of the sample T (x 1 , x 2 , . . ., x n ) = T (X) which specifies a region on the parameter space Θ where the null hypothesis is accepted, Θ H , and a rejection region where the alternative is accepted, Θ K .
where rrn is a random real number uniformly distributed in the interval [0, 1) and ⌊ ⌋ is the floor function.

Table 6
Point estimation and 95% confidence intervals of Kolmogorov-Smirnov statistic to estimate the cut-off u parameter, and maximum likelihood estimation of the shape α parameter using Gillespie's R package [36] with tails or Total obs−rank( û) a Pareto distribution is not plausible.The method is straightforward.First they fit the empirical data to a Pareto model using MLE and then calculate the same KS statistic they used to pin down the u parameter.They then generate a vast number of Pareto synthetic distributions using as parameters the ones achieved in the first step and calculate KS for these models.Then the proportion of these values larger than the empirical data is the p-value.Fig. 11 depicts the p-value for this test of the null hypothesis that the upper tail of our frequency distributions is Pareto using the methodology of Clauset et al. [5].The test is positive to the null for all the years except 1871.

Pareto or Lognormal?
When making a decision (see, for example, Lehmann and Romano [59]), it is necessary to distinguish Type I error (rejecting H when H is true) from Type II error (accepting H when K is true).The power function β(θ)-where θ is the parameter tested-contains the information to test these.The ideal is to have β(θ) = 0 for values of the parameter θ ∈ Θ H and β(θ) = 1 for values of the parameter θ ∈ Θ H .A test is assessed according to the degree its accomplishes this ideal.
Normally, Type 1 error is fixed at a given level of significance equal to 0.01, 0.05 or 0.1.Moreover, a test is uniformly most powerful (UMP) if its power function is closer to one than any other test testing the same H and K hypotheses for every value of the parameter in the K subspace.This means, that after setting a Type 1 error at a given level of significance, we are trying to find the best test among the tests of the same class comparing its performance not to commit a Type 2 error.Finally, when this is not achievable, it can be demonstrated that sometimes an unbiased UMP, or UMPU, test can be found, where unbiased simply means that its probability of committing Type 2 error is always higher in the relevant hypothesis subspace than the probability of committing Type 1 error, which is a generally plausible assumption.
After introducing this approach, we can present the H and K hypotheses we are confronting as summarised by Malevergne et al. [31]: ''H: Pareto distribution for values of x larger than some threshold u and K: Lognormal distribution also for values of x above the same threshold u'' [31] To test these hypotheses, Malevergne et al. show that an UMP test is not achievable, while they suggest as an UMPU test the maximum likelihood ratio, that is ''with insertion in the maximum likelihood ratio of the maximum likelihood estimates of the unknown parameters instead of the true values'' [31].In mathematical statistics, this test is called the Wilks test [31].Malevergne et al. show that an equivalent statistic to this UMPU-Wilks test is the clipped sample coefficient of variation ĉ = min(1, c), where c is the sample coefficient of variation, that is the ratio of the sample standard deviation to the sample mean:  2 (see [60]) where the confidence interval under the null of Lognormal behaviour expands disproportionately at the tail of the distribution completely compromising the test ability to discern between the vying hypotheses.According to Malevergne et al. the clipped sample coefficient of variation can be deemed as a sufficient statistic for the optimal UMPU test.The critical function of this test is a function of the sample T (x 1 , x 2 , . . ., x n ) = T (X) equal to, as shown by [31]: where h is the critical value chosen ''such that the probability that the inequality ĉ < h holds under the null hypothesis [Pareto distribution] is equal to the (small) preliminary chosen [level of significance]'' [31].To derive h, Malevergne et al. suggest two methods: a saddle point method as shown by Del Castillo and Puig [32] and Gatto and Jammalamadaka [61] and a simpler Monte Carlo approximation.Malevergne et al. refer to Del Castillo and Puig [32] who showed that for the problem of testing the null hypothesis that the upper tail is exponential against the alternative of a truncated normal, the UMPU-Wilks test is, also, the clipped sample coefficient of variation.Finally, they describe how to perform the simpler Monte Carlo method to find h.For this a large enough number of samples M of the standard exponential function with unit parameter has to be generated-they suggest at least 10,000-.Then, for each sample, ĉ is calculated and compared with that of the sample under scrutiny.Therefore, as argued in [31], the ''fraction of exceedances provides a good statistical estimate of the corresponding p-value of the null hypothesis [Pareto distribution]''.
Yet another statistic is the logarithm R of the ratio of the likelihoods of the data under two competing distributions presented by Clauset et al. [5].If the outcome of the test is positive then the first model has preeminence over the second, if the outcome is negative the second should prevail.But it is also necessary to correct for random fluctuations.To do so the ratio R is standardised by the standard deviation σ using the method first presented by Vuong [62].The method gives a p-value that tells ''whether the observed sign of R is statistically significant'' [5].If the p-value is small, say below

Table 7
Loglikelihood ratio test for comparing the Pareto hypothesis with an exponential, a stretched exponential, and a Lognormal.The test results provided are the normalised ratio and the p-value calculated from it.A small p-value means the direction of change is caused not only by random variation.A large p-value means that the statistic cannot discern between each pair of distribution.Alstott package [35]  a significance level of 0.01 the test is unlikely to be a chance result of fluctuations.The hypotheses tested by Clauset et al. [5] are the following: H: Pareto distribution for values of x larger than some threshold u and K 1 : exponential distribution also for values of x above the same threshold K 2 : stretched exponential distribution also for values of x above the same threshold K 3 : Lognormal distribution also for values of x above the same threshold The results comparing the Pareto distribution with the exponential-which is the absolute minimum alternative because the usual definition of ''heavy-tail'' is against the ''light-tailed'' exponential behaviour [63]-the stretched exponential, and the Lognormal are given in Table 7.All the distributions are confirmed as ''heavy-tailed'' because the likelihood ratios are positive and the p-values smaller than 0.01.For the stretched exponential the frequency distributions for years 1851 and 1871 are positive and significant, but the other two frequency distributions for 1861 and 1881 are not, even though they are marginally non-significant.The four size distributions compared to the stretched exponential have best fit by the Pareto distribution with strongly significant p-values.The comparison between Pareto and Lognormal for the size distributions, have the year 1851 significant at a 0.05 level and the year 1861 at any level, ruling out the Lognormal model and confirming the Pareto behaviour.For 1871, the significance is slightly inconclusive and for the year 1881, the significance is clearly inconclusive.With these results it is possible to answer the main question of this paper: at least for the years 1851 and 1861 the size distributions have better fit by a Pareto tail than by a Lognormal one, defining the tails as all the observations above the estimated u cut-off parameter.The test is inconclusive for the frequency distributions with high p-values and alternating signs.Thus, it is not possible to rule out the Lognormal or the Pareto distributions with the methods employed for the frequency distributions.

To Zipf or not to Zipf?
Having constructed, in Table 6, the 95% confidence intervals for the α parameters for the two tails, size and frequency, it is direct to test the null that the model follows a Zipf behaviour.This is relevant, as noted earlier, where Zipf behaviour is a ''Pareto behaviour with an exponent equal to 1'' or a ''Power Law with a cut-off and an exponent equal to 1''.The answer is an absolute no.None of the intervals includes a value compatible with Zipf behaviour in the tail.So a rigorous and simple hypothesis testing permits us to be conclusive in this respect that the two tails are not Zipfian at the 0.05 level as implied by the confidence intervals displayed in Table 6.

A comparison with modern times
We compare our data for the latest year (1881) with figures taken from Robson and Gallagher [15] for the UK for 1987, as shown in Table 8.This is the closest year of modern data to the historical censuses 1851-81.In the table, we compare the number of firms, the number of employees and the percentage of employment (the percentage of employees in a given bin out of the total employees in the economy) for size category used by Robson and Gallagher for 1987 with our data for 1881.In this table we have to use bins to match those in the Robson and Gallagher data, although we acknowledge that bins distort the data, as noted earlier.Unfortunately, we cannot perform here the same tests developed in the previous sections of the paper because the full dataset of Robson and Gallagher [15] is not archived.If data reemerged it would be a promising area for future research.Also, it needs to be stressed that UK data are not equivalent to our data for England and Wales.Nevertheless, the comparisons are valuable.There is an increase of 399.7% (5.8% annually) in the number of firms, from 181,125 to 905,155; and in the number of employees of 574.1% (6.2% annually), from 2,142,404 to 14,433,000.In the twenty bin size categories, fifteen show a decrease in the percentage change in percentage employment with a minimum of −63.9% in the 500-999 bin, while five show an increase with a maximum of infinity in the over 10,000 bin because there are no firms in this bin for 1881.There are only 179 very large firms over 10,000 in 1987 and they account for a huge 22% of the employment; in our historical data the largest 179 firms had 12.3% of employment (calculated from our individual firm data), roughly half the modern figure.Although there is variation between bin categories, the dominant pattern is for positive and high percentage increases in the two largest bins, and decreases in all but one of the small and medium-sized categories.Although constrained by differences in the bases of the data over time and the bin categories, the lengthening and fattening of the tail is a continuation of the trend we observe in 1851-81, and in line with the expectations of the Pareto and Lognormal distributions: i.e. the tail increases massively in fatness and length over time.Also, this is in line with economic literature that has suggested that significant concentration into larger firms probably began in the mid-Victorian period after the 1870s and 1880s and continued thereafter (e.g.Hannah [64]).

Discussion and conclusions
This paper explores two distributions of firm data (size and frequency) during the mid-Victorian era to test the Power Law compared to the Lognormal hypotheses at the tails: that is, a tale of two tails.We show that the Pareto distribution fits the tails of the size data better than a Lognormal one for at least the years 1851 and 1861, and also with marginally non-significant values in 1871, but for the frequency distributions the test is inconclusive.Thus, the main contribution of this paper to the literature is to demonstrate that the new dataset for firms size in mid-Victorian era is heavy-tailed, with the tail belonging to the Power Laws rather than the Lognormal distribution.Comparison with more modern data for 1987, although constrained by the available data and categories, confirms that tails have grown larger over time, whilst the main part of the distribution for small firms has been reduced.Also, we have been able to rule out a Zipf behaviour in the tails of our size and frequency distributions.Some of the techniques revisited are the UMPU-Wilks clipped sample coefficient of variation for distinguishing between the Power Laws and Lognormal behaviour, MLE to point estimate the u and the α parameters, KS to measure the distance between the actual data and the fitted one, and log likelihood ratios to compare distributions and Vuong procedure to obtain the p-values for this comparison.
We use the original firm size data at individual level which is exhaustive, thus meeting Fujiwara et al. [46] condition that the data are complete to allow identification of tail effects.And we avoid grouping (binning), which Bee et al. [33] argue is one of the most overlooked limitations of many previous analyses of firm size.Despite, Fujita et al. [1] questioning how to distinguish between distributions: ''Is this a solution to the riddle?Our view is that it is ingenious but not entirely satisfactory'', we are now in a position to answer for the distributions in the tail.Using the advances in the literature by Clauset et al. [5] and Malevergne et al. [31] we have shown how rigorous assessments of firm size tail behaviour can be used to test between these two tales.With these procedures it is possible to answer the main question of this paper: at least for the years 1851 and 1861 the size distributions have better fit by a Pareto tail than by a Lognormal one, defining the tails as all the observations above the estimated u cut-off parameter.This outcome is conclusive and stands as a new addition to the profuse literature on this issue.We are convinced that these new procedures recently discussed in the literature permits to properly test the tale of the tails: are they Pareto or Lognormal above the cut-off?We expect this discussion to carry on in the literature as more tails, current and past, being questioned on their tail behaviour.At least, we deliver the answer for this newly available mid-Victorian firm size and frequency.

Fig. 4 .
Fig. 4. Diagram of the frequency of firms in 1851-1881.(Upper left quadrant 1851, upper right quadrant 1861, lower left quadrant 1871, lower right quadrant 1881.) and that x min = u Moreover, the Power-law, Pareto, and Zipf are related by the following relation Power-law ⊃ Pareto ⊃ Zipf 3 It may be confusing the jump between Pareto Law, Pareto, and Zipf.To clarify this, we argue, as will become clear below, that Power Law includes Pareto and that Pareto includes Zipf.
α 0 = 1 but by Clauset et al. when α = 2. Henceforth, we call the cut-off parameter u as in Malevergne et al. and we use Clauset et al.α so the Zipf's law will be attained for a value of α = 2.

Fig. 5 .
Fig. 5. Rank/Frequency plots in decimal logarithmic scales of the frequency of firms by rank of firm size frequency in 1851-1881.(Upper left quadrant 1851, upper right quadrant 1861, lower left quadrant 1871, lower right quadrant 1881.)

Table 2
The ten largest firm sizes in England and Wales for each census year.Source: Censuses 1851-81.

Table 4
Matrix correlation of the frequency and size distributions.

Table 5 Kolmogorov
[5]]rnov procedure to estimate the cut-off parameter u and maximum likelihood estimation of the shape parameter α using Clauset et al.MATLAB package[53], and Gillespie's R package[36].Clauset et al.[5]present a proper Pareto test claiming that ''[m]ost previous empirical studies of ostensibly power-law [they use power-law but in practice they refer to Pareto] distributed data have not attempted to test the power-law hypothesis quantitatively''.They introduce a null hypothesis of a Pareto distribution versus an alternative hypothesis that and percentage of frequency and size of firms in the tail.
in use with 5000 simulations for each of the tails.