Inferring the drivers of language change using spatial models

Discovering and quantifying the drivers of language change is a major challenge. Hypotheses about causal factors proliferate, but are difficult to rigorously test. Here we ask a simple question: can 20th Century changes in English English be explained as a consequence of spatial diffusion, or have other processes created bias in favour of certain linguistic forms? Using two of the most comprehensive spatial datasets available, which measure the state of English at the beginning and end of the 20th century, we calibrate a simple spatial model so that, initialised with the early state, it evolves into the later. Our calibrations reveal that while some changes can be explained by diffusion alone, others are clearly the result of substantial asymmetries between variants. We discuss the origins of these asymmetries and, as a by-product, we generate a full spatio-temporal prediction for the spatial evolution of English features over the 20th Century, and a prediction of the future.


Modelling language evolution
Modelling the collective behaviour of systems of sentient agents, from flocks of birds [1,2], to economies [3], cities [4] and languages [5,6,7], is attractive, but not easy. To quote Emanuel Derman [8], an early adopter of financial modelling, In physics there may one day be a Theory of Everything; in finance and the social sciences, you're lucky if there is a usable theory of anything.
The problems which arise when trying to model social systems depend to some extent on the system in question, but share a good deal in common. In many cases we have only one realisation of the change process which we want to understand, and the data available about the change may be sparse. Because we cannot observe the thought processes and motivations of the agents, we caricature them with a simple model. However, many different models may fit observations similarly well. This problem recedes if we have more data and more realisations, which allow us to apply Statistical Learning methods [9], such as cross validation, to rigorously select the "best model" by maximizing its ability to predict unseen data. If this is not possible, because we have only one realisation of a particular change process, then another approach is to start with some simple default theory, or null model [10], and then seek to determine whether it is sufficient to explain observations. We may also consider an alternative model with incrementally increased complexity and minimal additional assumptions, the aim being to understand what is missing from the null model, while avoiding unjustified assumptions about what should replace it.
The broad question which interests us is: what are the processes which drive the evolution of languages, and how should they be represented in a coarse grained spatial model? A longer term aim is to be in possession of sets stochastic equations, derived from simple assumptions about speaker-level behaviour, which describe how language evolves over space and time, and amongst different social groups. Such equations should calibrate to available data on the language state up until the present, then allow us to understand historic changes, and make future predictions. Here we take a step toward addressing these aims, by fitting a coarse grained spatial model to two large scale spatial datasets which capture the states of various English language features at the beginning and end of the 20th Century.
The mathematical modelling of language evolution, often using methods of statistical physics, is now an established field. Many aspects of language change have been studied using non-spatial models [11,5,12,13,14], and an important focus has been to understand the replacement of one linguistic feature by another, typically following the ubiquitous temporal S-curve [15,16]. Such changes can be driven by numerous mechanisms, including those which favour particular variants (e.g. regularization [17], social biases [18], or exogenous factors [16]), by stochastic effects [19,20], and by biases driven by linguistic variation over the age spectrum [12,13,21,22,23]. In this work we allow speakers a learning bias toward individual linguistic features, without explicitly modelling any particular mechanisms which might generate this. We seek only to determine whether some form of bias existed, given the observed changes.
Spatial modelling of language change is a topic of growing interest [24,25,26,27,28,7,29,30,31]. Whereas non-spatial linguistic data is plentiful, detailed spatial datasets are rarer. Their construction typically requires either large scale collaborative effort [32,33,34], or viral success online [35,36,37]. Spatial modelling presents additional challenges in the form of greater analytical and computational complexity, particularly when matching models to real world spatial language distributions. There is also long standing interest within the quantitative linguistics community in the statistical analysis of spatial, social and temporal variations within real linguistic domains [34,38,39].
Recent spatial models [7,29,30,31], which accurately model geography and population distributions, suggest that the shapes of linguistic domains and the locations of cities and towns can have a partially predictable influence on the evolution of language. In fact, the importance of geography has been well known to dialectologists for a long time [18,40]. From a mathematical perspective, this influence arises from the impact of ge-ography on the shapes of boundaries (isoglosses) between alternative linguistic features, which behave like two dimensional bubbles. The existence of this surface tension effect has recently been tested using historical English dialect data [31]. We therefore believe that space and spatial processes must be part of models which seek to understand language change processes. Here we provide, to our knowledge, the first detailed spatial model which matches the evolution of real linguistic features within an extended spatial domain (England in the 20th Century), accounting for realistic migration patterns, and plausible learning behaviour.

Data and existing theories 2.1 Linguistic variables and variants
For the purpose of academic study, languages may be broken down into distinct components: single units of sounds (phones), rules for combining sounds (phonology), words (the lexicon), rules for word construction (morphology), and rules for sentence construction (syntax). A language may be viewed as a complete specification of all its components, and language change as the process of progressively exchanging components for alternatives. We refer to a language feature for which there are a set of alternatives as a variable, and to the alternatives as variants. These might be different words for the same object or idea, different sounds playing the same role in certain groups of words, or alternative syntactic rules. As an example, the word for the season after summer historically had variants autumn, backend and fall in England [32]. The fundamental quantities which we build models of are the relative usage frequencies of variants at different locations (or regions) in space. Simultaneous spatial variations in many linguistic variables can create distinctive regional dialects.

The Survey of English Dialects and the English Dialect App
We model the spatial evolution of a set of linguistic variables over the 20th Century in England, the largest and most densely populated nation within the British mainland. As initial and final conditions, we use data for the same variables from two surveys: the Survey of English Dialects (SED) [41] and the English Dialects App (EDA) [35,36].
The SED was carried out in the 1950s at 313 localities. With 986 respondents, it is a small survey by modern standards. The localities were selected to provide a relatively evenly-spaced sample across England, but included almost no speakers from urban areas. This was a deliberate choice; the survey sought out the most conservative demographic (typically farm-labourers born in the 1870s and 80s) to capture the network of "traditional" dialects before they disappeared. Methodologically, the SED was rigorous, making few allowances for speed or convenience at the expense of data volume and quality: fieldworkers visited respondents in person and recorded their responses to over a thousand questions in narrow phonetic transcription (and later also on tape). Despite some criticisms arising from systematic differences in how fieldworkers asked questions or transcribed responses [42,43], the SED is a rich and relatively trustworthy source of data and has provided the material for more than half a century's worth of secondary analysis.
The EDA represents a new generation of dialect data collection methods, using digital technologies to reach a large number of respondents very cheaply. Over 50,000 speakers answered 26 questions about their usage through a smartphone app in 2016, and all but one question duplicated variables surveyed in the SED. The EDA exemplifies a different set of trade-offs between data quality, quantity and speed. It is possible to mine the SED data for patterns across many responses and examine variables which were not specifically targeted, whereas the EDA required speakers to decide between predefined categories, limiting analysis to 26 variables and predetermined sets of variants. Instead, the EDA's major advantage is its 45,287 speakers in 39,590 locations within the region covered by the SED. It is also more representative, covering urban areas and all demographics, although, because it was carried out via smartphones, it is somewhat skewed towards younger and more affluent speakers.
In spite of methodological differences, we argue that comparison of the SED and EDA is a viable way to explore language change in 20th century English English. Laboratory perception tests [36] suggest that speakers are largely able to discriminate between the variants of the phonetic variables elicited in the EDA. Spatial patterns in the results of the EDA, when compared to the SED, show that levelling (loss of variants) and isogloss (linguistic boundary) movements are in line with our prior understanding of the mechanisms of language change. However, because typical SED respondents were old, while EDA respondent were younger, we are not simply mapping the 60 or so years of language change between the median dates of the two language surveys. If we assume that speakers' linguistic norms are set early in life and change little in adulthood (the underlying assumption of much historical linguistic [44] and sociolinguistic [45] work on language change), the relevant comparison is instead speakers' dates of birth: the median date of birth is the 3rd of May 1881 for SED respondents, and the 20th of July 1983 for EDA respondents, implying that we are looking at around 100 years of change.
Because urban areas are under-represented in the SED, we must consider whether city varieties are likely to have differed substantially from the varieties in the regions surrounding them. In that case SED maps would be missing islands of highly divergent usage in places where spatial models [7,29,30] suggest that population density gradients should preserve those distinctions or encourage the spread of urban variants. However, the rapid urbanisation of the industrial revolution had started only a couple of generations before the SED speakers were born; less than the three generations required for new dialect formation, according to standard models [46], and certainly not long enough for urban varieties to have substantially diverged from their regional inputs. This is borne out by evidence. For example, Coates [47], surveying research on traditional Bristol English, notes: "what can be shown to distinguish Bristolian at all linguistic levels from other dialects of the region is relatively little". Clearly city varieties at the time of the SED will have had distinguishing features, just as varieties at any point in space differ in some ways from those nearby; but we can assume that they tended to agree in most respects, and so their omission is not likely to be any more problematic than a situation where any other location happened to be lacking in samples.

Current hypotheses about linguistic changes
The overwhelming story of change in English English dialects over the 20th and early 21st centuries is one of loss of diversity. In the words of Britain [48]: There has been such considerable and ongoing dialect attrition that the language use reported across the country by Ellis's survey of 1889 seems, in many cases and in many places, quite alien to that spoken just over one hundred years later.
In place of smaller traditional dialects we find distinctions at larger, regional levels [49,50,51], and much research has focused on these regiolects and the processes of regional dialect levelling [52,53,54,55,49] which generated them. Reductions in geographical linguistic differences are not only due to convergence to one of several local variants but also to geographically widespread adoption of common innovations [50]. These changes are part of a near-universal pattern across the traditional varieties of Europe [56,57,58].
It is believed that interactions between individuals can lead to a decline in linguistic variation via accommodation, where conversation partners adjust their speech to better match each other, and by child learners acquiring accommodated forms [59,60,61,62,63]. However, in certain social contexts, children may also learn variants directly from mobile outsiders [64]. The fact that accommodation is mediated through interactions between speakers has lead some linguists to conclude that the primary drivers in the decline of linguistic diversity are travel, commuting and migration [59,65,49,55,66,50,67]. Other potential drivers discussed include changes in social network structure [59,68,65,49,50,55,66], the age structure of the community [69], the influence of mass media [58,70], normative attitudes and education [71,72,58] and relatedly the salience and stereotype status of particular variants [73,55], identity factors [74], the informalisation of public life [58] and socio-economic forces [55]. Purely linguistic internal factors such as structural regularity, functional economy, or naturalness may determine which variants win out in the levelling process [59,67,49,55] or they may not be relevant at all [71,50]. Arguments have also been made for the central importance of idealogical factors, in particular strong normative attitudes towards the standard, alongside mobility and contact [74,72,75]; parallel to these arguments, it has been suggested that strong alignment of speaker identities with the local community may be enough to check the levelling process and so preserve a distinct local variety [58,51].
There are clearly numerous potential drivers of linguistic change, and it is beyond the scope of this work to quantify their relative importance. It may be that driving processes which operate in different ways at small scales (at the level of motivations, interactions and contacts between individuals) yield similar or identical terms in evolution equations for coarse grained population averages, making the task of inferring the importance of individual effects impossible using such models. However, macroscopic models have been used to infer different classes of driving mechanism from non-spatial linguistic time series, specifically exogenous (population level) vs endogenous (individual level) drivers [16]. We take a similar approach here, but our two classes are spatial processes (migration, movement), and processes which introduce asymmetry with respect to variants (ideology, internal linguistic effects, social prestige, normative bias toward a standard etc.). We do this by defining coarse grained evolution equations which account for both movement and biased copying. We then fit our model to the initial and final conditions supplied by the SED and EDA data, to determine the relative importance of each process. Since movement is often viewed as the primary driver of change, we view variant symmetric dynamics as our null model, and determine the extent to which it can explain the observed changes, before breaking variant symmetry by allowing for biasing factors.

The model
We consider a spatial domain divided into L cells, each containing (approximately) N speakers. The centroids of these cells are written r 1 , r 2 , . . . , r L . Consider a linguistic variable with q ∈ {1, 2, . . .} variants, and let the relative frequencies with which these variants are used within cell r define the frequency vector f(r) = (f 1 (r), . . . , f q (r)) T ∈ ∆ q , where ∆ q is the q-dimensional simplex and T denotes transpose. We have suppressed time dependence for brevity. The cells we use in our analysis are Middle Layer Super Output Areas (MSOAs), a set of L = 7, 201 geographical polygons, each with a similar number of residents, used for reporting census data in England and Wales [76]. The mean MSOA population in 2011 was 7, 787, and the modal area of English MSOAs is approximately 0.29km 2 , corresponding to a few tens of streets within a densely populated city. Interactions with areas outside the English borders, for which we lack survey data, are neglected. We justify this simplification on the basis that population densities are low at the Scottish and Welsh borders, the English make up the great majority (≈ 87%) of the British mainland population, and Wales and Scotland possess distinctive cultural and linguistic identities. We assume that much longer range interactions, for example from American TV, cinema, and music, will be subsumed into biases in the learning function (see section 3.3) which determines the adoption probabilities of different variants by young speakers.

The language community
We refer to the language that a speaker is exposed to as their linguistic environment, and assume this environment consists predominantly of voices from their own and nearby cells. We allow for biasing factors by weighting the influence of variants using a set of biases which affect the learning process (see section 3.3). We capture the local environment of a speaker in cell r using the community influence matrix, W , where W (r, r ′ ) gives the influence that speakers from r ′ have on speakers from r. The matrix is stochastic, meaning that it is square with unit row sums. We define the community frequency vector of a variable as follows f(r) where denotes a definition. We assume that local influence strengths are mediated by two factors: physical separation, and population distribution. A definition which incorporates both is where we call σ the interaction range. To illustrate the competing effects of interaction range and population distribution we consider the one dimensional small-cell limit of the influence matrix, which takes the form of a kernel giving the influence of location x on location x 0W where ρ(x) is population density at x, inversely proportional to cell size at this position. Population density is automatically accounted for in definition (2) because higher density areas have more cells. Figure 1 shows the spatial distribution of influences on three speakers who live in or between two cities with radii and separation similar to Birmingham (population ≈ 10 6 ) and Coventry (population ≈ 3 × 10 5 ). Consider a speaker in central Coventry (x 0 = 15). Most of the influence on them comes from within the city, because there are relatively few people to interact with outside. This makes their effective interaction range smaller. A speaker in central Birmingham (x 0 = −15) will have a wider range of influence because their city is larger. For a speaker in between the two cities (x 0 = 0), influence comes from both, with the larger city dominating (see blue curve in Figure 1). In the limit of very large cities (or other regions of approximately uniform population density) the influence distribution becomes normal so that ≈ 63% of influence comes from within σkm. We assume that speakers' linguistic states are fixed in their youth (see section 3.3), so the interaction range should be chosen to capture the community state observed by a younger speaker. An empirical measure of such separations is provided by school travel distances; secondary education has been compulsory since 1918, so we would expect children to regularly interact with others from their school catchment. In the early 21st century secondary pupils travelled, on average, ≈ 5.5km to school [77]. We will call this the mean catchment radius, denoted µ R . The number of state schools has declined somewhat over the 20th century; in 1951 there were 5,900 vs. 4,072 in 2010 [78]. This implies that average catchment radii have increased, with 1951 radii being ≈ 83% early 20th century radii. Scaling 2010 travel distances by this factor yields an average travel distance of ≈ 4.5km. Suppose we model the distribution of displacements between home and school as a bivariate normal distribution then the mean home-school distance is µ R = π 2 ω and the mean distance between two homes is 2ω = 8 π µ R ≈ 1.6µ R . This suggests a typical distance between interacting speakers is between 5 and 10km. Since σ functions as an upper bound in cities, we take our interaction range as the upper limit of this interval, so σ = 10km ≈ 6miles.
We assume that speakers from a given cell have similar environments, so that f(r) approximates the perceptions of all speakers in zone r. The assumption that all the speakers in a cell access the same primary linguistic data is a simplification of reality. We know that within many geographical areas there are subgroups whose language differences may depend on class, sex or ethnicity [18]. We simplify in this way because we are interested in national level spatial variations for which we have detailed geographical data, but no consistent breakdown by social subgroup, so cannot infer any effects of subgroups. We implicitly assume that social subgroups can induce biases on variants.

Migration
Migratory processes may be internal, where individuals move between locations in a single language area (typically a country), or external, where they arrive from another country or language area. During the 20th Century, foreign-born people have made up a small and slowly changing percentage of the English population; 4.2% in 1951, for example, rising to 6.7% in 1991 [79]. The cumulative total number of migrants to Britain from the early 1800s to 1945 is estimated as 2.34 million [80]. Since numbers are small, we do not explicitly model external migration, instead viewing it as one of many possible factors which may induce biases. To model internal migration, we need to know how often people move, and how far. The English Longitudinal Survey of Ageing (2007), showed that for individuals aged 50-89, the average number of different residences occupied for at least six months during their lifetime was 5.6 for those born 1918-1927, rising to 6.42 for those born 1948-1957 [81], giving an average of approximately one move every ten years. Inter-county migration distances may be extracted from birthplace and place of residence recorded in censuses, providing data back to the mid nineteenth century [82]. Such data is, however, spatially coarse; it provides no information about short range migrations. Modern lifestyle survey data can help fill the gap. Analysis [83] of large scale research polls by Acxion Ltd (2005)(2006)(2007), which recorded current and previous postcode, generated ≈ 1.25 × 10 5 migration distances, having mean 25.77km, median 2.89km and standard deviation 63.91km. This implies a migration distribution concentrated at short distances, with sub-exponential decline at large distances. This large distance migration behaviour is often modelled with a power, or "gravity" law [83,84].
Models of migration must be consistent with migration distance statistics, and constrained so that total flows in and out of cells match observed values. This can be achieved [85] by writing the expected number of people who leave their address in cell r and move to a new address in r ′ as where O(r) is the total number of movers whose previous address was in r, D(r ′ ) is the total number of people whose new address is in r ′ , and h(|r − r ′ |) is a distance function which, typically, decreases for longer moves. The conditions imply that A mover might shift to a new address in the same cell, and the quantity m(r, r) gives the expected number of people who do this within cell r. The sets of constants A {A(r k )} L k=1 and B {B(r k )} L k=1 are found by generating a sequence of sets A 0 , B 0 , A 1 , B 1 , . . . where A 0 is the set for which A(r) = 1 for all r, B k is computed from A k using (9), and A k+1 is computed from B k using (8). The sequence is then iterated to convergence.
We assume that our cell populations are in equilibrium and without loss of generality assume there is one mover per cell, so O(r) = D(r) = 1, and m(r, r ′ ) is a stochastic matrix with elements which represent transition probabilities for a single mover. The matrix can be rescaled to capture realistic migration rates. We define the distance function where the first term models short range moves with typical range a, and the second term is the long distance fat-tail of the distribution, with gravity exponent ρ. The parameter ω ∈ [0, 1] interpolates between entirely short range, and pure power law distributions.
Having calculated m(r, r ′ ) for our system, the cumulative distribution of the migration Figure 2: Cumulative distribution of migration distances generated using distance function (10) with ω = 0.75, a = 3km and ρ = 1.5. The mean, median and standard deviation are 31.3km , 3.6km and 63.4km respectively. For comparison, the corresponding statistics computed from Acxion survey data [83] are 26.3km, 3.6km, and 63.7km. Horizontal dashed lines show the fraction of moves less then ten miles (71%) and fifty miles (87%).
distance X is calculated as where H is the Heaviside step function and R(r) is the radius of a circle with identical area to cell r. The second term in the square brackets accounts for all intra-cell moves, assumed to be of distance R(r)/ √ 2, following [86]. We select distance function parameters to match the short range (median) and long range (standard deviation) statistics of the inter-and intra-cell distance distribution computed in [83] for MSOAs, using the Acxion data. This was achieved by parameter grid search to minimize the sum of the relative errors in the median and standard deviation between the model and the Acxion data. The resulting distribution ( Figure 2) is our estimate for migration distances at the turn of the 21st century. Its shape reflects two kinds of migratory events: frequent short moves, perhaps to change accommodation, and rarer long distance moves, leaving behind friends, work and other social contacts, where the distance travelled becomes relatively less significant.
We now consider whether this distribution may be used to approximate migration during in earlier periods. Increased move frequency over the 20th century does not imply that the distribution of distances has also changed. Although we lack detailed information about historical short range moves, we can compare our distribution to statistics of longer migrations extracted [87] from the National Marriage and Fertility survey (1959)(1960), which sampled more than 2000 couples aged 16-60. The survey showed that 16% of moves were of 10-49 miles and 15% were 50 miles or more. Our corresponding percentages are 16% and 13% respectively ( Figure 3). The similarity of these estimates, and the sample error associated with historical percentages (estimated at ±1%), suggests that the statistics of long distance internal migrations have not changed dramatically. Because people are most likely to migrate in their mid twenties [81], many of the moves in the 1959-1960 marriage survey may have been made much earlier in the century.
We also consider whether our assumption that each cell is in equilibrium is reasonable. Using estimates of net inter-census population changes calculated by Friedlander [82] for counties in England and Wales, we have computed the yearly percentage population change by county, averaged over 1851-1951 ( Figure 3). The results show south east England experienced the greatest inflow of migrants, and that the root mean squared net flow rate over all counties is ≈ 0.45%. By comparing this figure with the overall yearly migration rate of ≈ 10% we see that, on average, in cells that are gaining, for every 20 people who leave, 21 will arrive. Since net changes in population are therefore typically at least an order of magnitude smaller than than absolute population flows, we assume that non-equilibrium effects can be ignored.
Long range migration in principle allows variants to jump between locations within the spatial domain. This jumping process is the subject of the linguistic gravity model [88,18], which attempts to quantify the spread of variants between a finite set of population centres, viewed as points in space. In contrast, the cells in our model cover the entire spatial domain, and population centres appear as densely packed clusters. Other than cell size, there is no intrinsic difference between city and rural cells. The effects of allowing city speakers to behave differently to their rural counterparts, in a spatial model similar to ours, was explored in [29], along with connections to other classical models of spatial linguistic spread including the wave model and hierarchical diffusion [89,40].

Language learning and evolution
We assume an iterative model of language evolution, in which dying adults are replaced by new speakers who select a variant using the probability mass function p(r) g f (r) (12) where g : ∆ q → ∆ q is the learning function. An alternative interpretation, which yields dynamics which have identical deterministic component, but no stochasticity, is that the components of p(r) give the relative frequencies with which new speakers use different variants. We could, if we wished, interpolate between these two interpretations by assuming a mixture of the two behaviours. In either case equation (12) captures the learning process which converts infants into adult speakers. It is an approximation of the cumulative effect of countless small iterations made within a changing learning environment. Underlying this model is the assumption that linguistic behaviour is mainly acquired in childhood, and changes little in later life. Learning to speak requires the assimilation of a variety of structures, from sets of individual sounds and sound patterns, through words and morphological rules, to syntax. Learning processes differ between structures, and we do not aim to capture these processes in detail. Rather we propose a simple but plausible learning function which maps current speech patterns to learned forms. We derive evolution equations by considering the case which generates maximum stochasticity: where new speakers select a single variant. We will then consider the importance of this stochasticity. A simple learning model, analogous to neutral evolution in genetics [90,91,92,5,14], is that variants are selected with a probability equal to their current frequency. This has two problems as a model of language evolution. First, as we will show below, linguistic communities would take an extraordinarily long time to settle on one feature. Second, the geographical boundaries between language features (isoglosses) revealed by linguistic surveys, appear to require some form of social conformity [7,29,31] meaning that speakers preferentially select the most common variants with a probability which exceeds their relative frequency. Such a selection strategy is optimal if advantage can be gained by matching the speech of others. We can incorporate both conformity and asymmetric bias toward different variants by defining the learning function where [f] k denotes the kth component of the vector f ∈ ∆ q . The variables h k > 0 are biases, and β ≥ 1 is the conformity number. If β > 1 then g(f) increases the frequencies of already popular variants, and reduces the frequencies of less popular ones. The biases h 1 , . . . , h q , which we write as a vector h, introduce variant asymmetry, with the selection probability of the kth variant an increasing function of h k . However, g(f ) is invariant under a constant re-scaling of all biases so it is their ratios which determine the extent to which variants are favoured.
To derive equations for the evolution of frequencies, we denote the counts of speakers using the variants of a given linguistic variable within cell r at discrete time step t as We model renewal (birth and death), and migration. At each time step, n R adults die and are replaced with young speakers, and n M migrate. The renewal and migration rates are We set λ M = 0.1, corresponding to one migration per decade, and λ R = 0.02 corresponding to a typical period of 50 years between reaching linguistic maturity, and leaving the language community. We justify this latter estimate on the basis that linguistic maturity occurs in the late teens [93], and that UK life expectancy in the mid 20th century was in the late sixties [94]. The total number of speakers who die or migrate in each cell is n n R + n M . These speakers are selected uniformly at random from the existing population, implying that both lifetimes and times between migrations are geometrically distributed with parameters λ R and λ M , respectively. If each migrator were to select their new cell using the symmetric transition probability matrix m(r, r ′ ), then the expected number of migrators received by each cell would be n M , and the expectation of the mean linguistic state of the migrators incident on cell r would be We model migration by replacing our n M migrators with n M speakers with states selected according to the probability vectorf(r, t). We write the variant counts of these speakers X M (r, t) ∼ multinomial(n M ,f(r, t)). The variant counts of the speakers who replace the dead is a random variable, X R (r, t) ∼ multinomial(n R , p(r)). The variant counts of the speakers who die or migrate is a random variable, Y(r, t), drawn from the multi-hypergeometric distribution [95] with parameters (n, X(r, t)). The change in variant counts between t and t + 1 is then where ∆X(r, t) X(r, t + 1) − X(r, t). Dividing by N, we have where ∆Z(r) ∆f (r, t) − E[∆f (r, t)] is a zero mean noise term. The variance of the kth component of this term, using the properties of the multinomial and multi-hypergeometric distributions, and suppressing r dependence for brevity, is where λ λ R + λ M . The magnitude of the stochastic element of cell dynamics therefore scales as N −1/2 . For reasons set out below, we assume ∆Z(r) ≈ 0, in the current paper.

The importance of noise
To explore the significance of the noise term in (21), we consider a small, well connected community where linguistic evolution is neutral [96,90,91], meaning that new speakers adopt variants with probabilities equal to their relative frequencies, so p(r) = f(r).
In the case of a binary variable, the relative frequency of variant 1 admits the diffusion approximation [97] where W t is a standard Brownian motion [98]. Equation (23) is known as the Wright-Fisher diffusion [97] and describes the evolution of the relative frequency of one of two variants (alleles) of a gene in a haploid organism. Notice that the evolution is driven entirely by noise in this case. Let T be the time taken for the cell population to settle on one variant (f = 0 or 1). This is the fixation time. Starting from the initial condition f = 1 2 , we have To be concrete, let us consider the case where N is the average population of a single cell (N = 7, 787) in our model, equivalent to a 0.25km 2 block of city streets. Using λ R = 0.02 (as above) we have E[T ] = 269, 877 years. From this we see that even in a relatively tiny population, the time for one variant to become dominant by noise alone is comparable to the time for which humans have existed as a species. We therefore neglect the stochastic element of our dynamics on the basis that it is inadequate to explain changes over a plausible time scale. Many of the linguistic variants which interest us have disappeared over the course of a single century and only the deterministic terms in (21) are capable of producing such a change.
Having decided to neglect the stochastic term in our model, we now consider whether other forms of stochasticity could explain the observed changes. We have assumed our small community is "well connected", meaning that all learners are exposed the same inputs. Suppose instead that children learn variants from only a small subset of the community. This effectively subdivides the population into smaller groups (e.g. families) within which variants are reproduced by neutral selection, and between which individuals are exchanged (e.g. by marriage). Such processes are studied in models of neutral genetic evolution in subdivided populations, such Wright's island model [99], and Kimura's stepping stone model [96,91]. At low exchange rates between groups, the time to fixation increases relative to the fully mixed (single group) case. At sufficiently high mixing rate the population becomes effectively panmictic, and fixation time statistics match the fully mixed case. Subdividing the population therefore does not change our decision to neglect the stochastic term in our model. One way for unbiased random copying to be a more powerful driver of change is via the topology and dynamics of the social contact network [92]. For example, if it is strongly clustered then certain individuals or groups become particularly influential. Another possibility is that variants spontaneously gain linguistic momentum [21,12,100] starting from small stochastic fluctuations. We do not rule out either possibility, but emphasize that stochasticity as we have modelled it is inconsequential in large populations. This does not mean that we think language change is deterministic. Rather, unpredictable driving forces must result from factors which are not part of our model. Whether their origin is stochastic momentum-like effects, social network dynamics, internal linguistic biases or societal changes outside the linguistic system, they all exert an effective bias (real or apparent) on language change, but in a direction that cannot be known in advance. We can view these factors as random variables whose values are realised as history unfolds. Having observed our linguistic system's historical state, we can then ask: how strong must these biases have been, acting in combination with migration, mixing and social conformity, to produce today's language distributions? Having established which variants require bias in our simple model, we can then turn to understanding their origin.

Interpreting model parameters
Our deterministic evolution equation models four processes, two spatial, two not. The two right hand terms are the replacement term, with rate constant λ R , and the migration term having rate λ M . Mundane mobility is represented by community matrix W (r, r ′ ) in our model, which is used to compute f(r, t) and migration by the matrix m(r, r ′ ), used to computef(r, t); the effects of these are modulated by conformity β. Together, these represent a purely spatial, accommodationand diffusion-driven model which does not differentiate between variants. Factors which might create bias (see section 2.3) are modelled with the bias vector h. If dialect change has mostly been driven by mobility, as many scholars have suggested, we would expect h ≈ 1. However, if linguistic factors have been crucial in selecting variants, if the standard ideology has driven levelling, or if local identity factors are important in explaining differences between conservative and innovative regions, then h will play a more substantial role. One possibility is that bias drives the selection of the levelling target, but that migration and other spatial factors determine the speed and course of levelling. This would be reflected in the model by high but spatially uniform h. In such a case, bias might reflect a linguistic constraint derived from a property common to the linguistic system at all localities, or a near-universal social pressure (such as normative pressure from universal schooling). Alternatively, observed regional differences in levelling in the EDA data might not be explained by migration and spatial factors alone and must be modelled with a strongly spatially varying bias. This would imply that local differences in language ideologies play an important role, or that interactions between dialect features create internal pressures varying between dialects.

Interaction between conformity and migration
In the absence of bias, when β > 1, then the replacement and migration terms in (25) represent two competing processes. The replacement term moves the community toward a single local variant, whereas the migration term pulls local forms toward the average linguistic state in a much wider linguistic area. Before comparing our model to data, we illustrate how it operates.

Survival of an isolated domain
We consider a binary variable, and let f(t) = (f 1 (t), f 2 (t)) T be its relative frequency vector within a dialect area D. Let r be the escape distance: the expectation of the distance from a randomly chosen point in D, to the boundary of D, in a randomly selected direction. The fraction of migrators who escape D is then approximately 1 − P(X < r) ǫ(r) where P is the migration measure defined in (11). By the symmetry of the migration matrix, this is equal to the fraction of speakers who enter D. Writing φ 1 for the relative frequency of variant 1 outside D, then the average state of the migrators incident on D (including internal movements) will bē The language state of D then evolves (assuming h = 1) as follows The first term generates conformity to the majority variant within D. The second generates levelling toward the external language state. Suppose that most speakers in D use variant 1, which distinguishes them from the nation as a whole, where variant 2 is used (so φ 1 ≈ 0). The equilibria (∆f 1 = 0) of (27) determine the conditions under which D will retain its distinctive feature. Informally, we look for the values of f 1 for which the rates of conformity and levelling are in balance |local conformity| = |national levelling|.
When φ 1 = 0, this condition may be written An upper bound on the migration rate which allows variant 1 to persist within D is obtained by considering solutions in the infinite conformity limit β → ∞ where learners always select the majority local variant. In this case the left hand side of (29) is a unit step at f 1 = 1 2 , and there is a stable equilibrium f * This is an intuitively reasonable condition; it says that a local distinctive feature can in principle survive provided the rate of inward migration does not exceed the rate at which young indigenous speakers replace dying adults. We estimate that during the 20th century, λ M ≈ 0.1. There is some flexibility in the way we interpret λ R . If we consider speakers to be members of the speech community from birth, then λ −1 R is simply life expectancy. However, if speakers are only influential once older, and cease to be so when very old, then λ −1 R is the average of this time interval. As noted in section 3.3, we have set λ R = 0.02. What determines domain stability is not however the absolute values of the migration and replacement rates. It is their ratio For given ρ, the stability of D can be computed for any value of β and r by checking that the dynamics (27) has a stable fixed point f * 1 > φ 1 . The phase diagram in Figure 4 shows that for lower migration-replacement ratios smaller domains can persist for lower values of conformity. In systems where the migration-replacement ratio grows slowly, such as in the English dialect area in previous centuries, we would expect smaller dialect areas to disappear as the phase boundary shifted.

Competition between migration and conformity in the English dialect area
We consider a hypothetical period of 500 years, beginning with a domain in a completely randomised state. This state is constructed by setting the initial frequency f 1 (r, 0) of each cell to be an independent uniform random variable on [0, 1]. We allow the migration rate to be a linearly increasing function of time, starting with zero migration and reaching late 20th century levels by end of the period Within a few decades, recognisable dialect domains emerge ( Figure 5). These are analogous to domains in ferromagnetic materials which form because neighbouring atoms tend to align their magnetic directions. Domain boundaries evolve as if they feel a form of surface tension [101] which penalises curvature. In physical systems, domains gradually expand as their boundaries straighten. The analogy between these magnetic domain boundaries, and isoglosses, was explored in detail in [7,29], including the effects of non-uniform population distribution. It was shown in [7] that concentrated human settlements (villages, towns, cities) balance the surface tension effect by inducing isogloss curvature, implying that stable isogloss shapes are partially predictable from population distributions. At low migration levels, competition between isogloss smoothing, induced by surface tension, and curvature, induced by population distribution (as well as other natural barriers) leads to stable geographical distributions. In Figure 5 the map obtained after 200 years would be stable in the absence of increasing migration. This stability is revealed in Figure 6, which shows how the system evolves starting from an initial condition which is an exact copy of that in Figure 5, but under conditions of consistently low migration (ρ = 1 2 ). This shows that the pattern reached after 200 years remains stable for the next 300 years, and includes a number of isolated dialect areas with small escape radii. Returning to Figure 5 we see that as migration increases these small dialect areas disappear, being replaced with the variant which is in the majority further afield. Eventually the system reaches a state observed in both the SED and EDA surveys, which consists of a major isogloss bisecting England. The stable north east dialect centred on Tyneside is also seen in survey data. The major Wash-Severn isogloss is particularly stable because neither variant is in the majority in the country as a whole, and because it bisects the country via one of the shortest routes. A detailed mathematical explanation of these effects, which are related to continuum percolation, is given in [7,29]. The north east dialect is stable in this simulation because of its status as a dense but geographically isolated population centre.

Methodology
We now explore how our model predicts the evolution of linguistic variables which have been recorded in both the SED and the EDA. These are listed in Table 1. For each variable we evolve our model T = 100 years forward in time using its SED state as our initial condition. To generate this initial condition from raw survey data, we set variant frequencies in each model cell (MSOA) to match the variant frequencies from the nearest SED survey location. If multiple variants were recorded in this location, they are assumed to all have equal frequency. This process yields a tessellation of England into single and sharedvariant regions. Relative frequencies in each cell are then replaced with average frequencies over its 20 nearest neighbours (mean separation 5.4 km). This amounts to a form of smooth interpolation between relative frequencies observed at SED survey points, which have typical separation ≈ 20km (calculated as square root of mean land area per location). The resulting distribution has smooth transitions with widths typically 10-30km, consistent with English transition zones discussed in Ref. [18]. While we would not expect our results to be sensitive to the number of nearest neighbours used in the smoothing (interpolation) step, we have avoided smoothing to a level where frequency values at SED locations would be changed substantially in transition regions (e.g. if the smoothing range approached the typical separation of survey locations).
We consider both bias free (h = 1) and biased evolution. We allow for the possibility that biases may be different in different parts of the country by introducing a spatially varying field h(r) = (h 1 (r), . . . , h q (r)) T (33) where h k (r) is the bias on variant k in cell r. We do not allow the bias field to depend on time because we have only initial and final conditions, and cannot therefore calibrate time dependence. We wish to fit h(r) so that the final state of our simulation, initialised with an SED variable, closely matches the corresponding EDA data. This is achieved by iterating the following learning step, which increases or decreases local biases on variants which are respectively under-or over-represented in the model, when compared to the EDA h n+1 (r) = η(f EDA (r) − f n (r, T )) + θ(1 − h n (r)).
Here h n (r) is the nth iteration of our bias estimate and f n (r, T ) is the frequency distribution obtained using this estimate. The learning rate, η, controls how rapidly adjustments are made, and the reversion rate, θ, is a regularization parameter that prevents runaway bias increases, and maintains h = 1 as the "no-bias" reference point so that calibrations for different variables can be compared. We use the initial condition h 0 (r) = 1.
To avoid over-fitting we introduce a smoothing factor, σ s , which interpolates between independent variation in bias between cells (σ s = 0) and constant bias in all cells (σ x → ∞). The quantity h n+1 (r) is the new bias estimate before spatial smoothing (indicated by the hat symbol). Having applied the learning step (34), we apply a smoothing step using 5. Do you pronounce the "r" in "arm"? rhoticity 1. no /r/, 2. /r/ 6. How do you pronounce the word "three"? realisation of word-initial /T/ This two-step iterative estimation method is continued until the bias field stabilizes. The value of σ s is determined by optimising the trade off between model complexity and model error. We measure complexity, K N , as the average ℓ 1 norm of the difference between the bias in each cell, and the average, h N (r), of its N nearest neighbours This measure approximates the magnitude of the second spatial derivative of the bias field, averaged over the whole system. The number of neighbour cells (N) is chosen to be large enough to smooth out inhomogeneity in the pattern of cell centroids, but small enough so that shorter range fluctuations in the bias field are not ignored. We take N = 20, corresponding to an average neighbour distance of 5.4 km, on the assumption that repeated application of the smoothing step will ensure that fluctuations on shorter scales are small. Complexity will be high for fields with large variations in bias over short distances throughout the system, and low for fields which are approximately constant over large geographical areas. Higher learning rates will generate bigger local changes at each step, but these are counteracted by increased smoothing range. Reducing the learning rate to η ′ = η/k, k ∈ N, is approximately equivalent to applying k smoothing steps between each learning step, or increasing the smoothing parameter to σ ′ s = √ kσ s , so we can explore the behaviour of our calibration method by fixing η and varying σ s . Figure  7 shows the relationship between model complexity and model error using variable 7 as an example. For high smoothing parameter (σ s = 30km) the error approaches that of the maximally parsimonious model, where bias is constant in space. Reducing σ s improves the model error, until we reach an optimal region (σ s ≈ 6km), the "knee" point, with low model errors and minimal complexity. Further increasing complexity has minimal effect on error. We take this knee as our heuristic for determining the optimal level of complexity because this is the simplest explanation that fits the data (Occam's Razor [103]). Exploration of the error vs. complexity trade-off for other questions reveals similarly located knee points. For consistency we use σ s = 6km for all questions (results with σ s = 10km are shown in supplementary material [104]). For comparison, we also calibrate with very high smoothing σ s = 150km in the case β = 2, which produces an approximately constant bias field. We select learning parameters η = 0.15, θ = 0.03 to allow convergence within a feasible time frame, while maintaining stability in the iteration.
We measure the difference between modelled and observed variant distributions at time T using the total variation distance [105] between the variant frequency distributions in each cell, averaged over all cells where · 1 is the ℓ 1 norm. The factor (2L) −1 ensures that the maximum error is 1. In this extreme case no speaker in any cell uses a variant observed in that cell in the survey.   (34) and (35). Very high smoothing parameter σ s = 150km leads to approximately constant spatial bias.
The model is implemented in vectorized Python using sparse arrays which accelerate the calculation of the community frequency vector. The number of bias field iterations is set between 25 and 35, sufficient for the error (37) to stabilize. To calibrate a single question using a single set of model parameters takes approximately 5-10 minutes on a modest laptop (1.7 GHz, 4 cores). For much larger systems than ours, there is potential for the migration matrix to create a bottleneck, requiring further optimization.

Bias free results
We begin by considering the bias free case (h = 1) where the only factors driving evolution are migration, daily movement and conformity. Figure 8 (a) shows the range of model errors in this case. In approximately 25% of cases the model generates a final distribution with error (< 0.2) comparable with errors obtained by fitting a bias field ( Figure  8 (b)). These are variants for which spatial process and conformity play a substantial role in explaining their evolution. Many exhibit one of two phenomena seen in two dimensional coarsening systems studied by physicists [101]: stripe states, and shrinking droplets.

Stripe states
In the bias free case, provided β > 1, our model behaves like a subcritical coarse grained two-dimensional ferromagnet [101,106]. The condition β > 1 allows interfaces to form between single variant domains, analogous to domains of uniform spin alignment in magnetic systems. These interfaces feel a form of surface tension, and in finite systems, stable configurations often emerge where interfaces connect opposite system boundaries [106], creating stripes. Variant [a] of variable 3 in Table 1 illustrates this phenomenon (Figure 9 (a)). Here, we see that the geographical boundary between those who pronounce the vowel in the BATH lexical set [107] (last, staff, brass, . . . ) in the same way as vowel as  the TRAP set (tap, back, badge, . . . ), from those who do not, has been stable over the 20th Century. Note that lexical sets (intoduced by John C. Wells [107]) are groups of words defined by some common phonological feature. The variant [a] (the vowel in northern "bath") dominates in the northern region, where the vowels in TRAP and BATH do not differ. This same stability is predicted by the model, and derives from the surface tension effect which causes interfaces to find the shortest routes across the system, which often begin in boundary indentations [7] (the Wash in Figures 9 (a) and (b)). One important difference between our model and simple magnetic systems is migration, which induces effective long-range interactions equivalent to local biases in favour of the national majority variant. Because the TRAP-BATH split divides the country into approximately equal parts, these biases are close to zero, stabilizing the boundary.
Another example of this effect, shown in Figure 9 (b), is the distribution of the variant [U] of variable 4. This is the northern English form of the stressed vowel in "butter" which belongs to the Wells' lexical set STRUT (cup, suck, . . . ). Speakers from the north typically use the same vowel sound for the lexical set FOOT (put, bush). Although the no-bias model captures the broad overall pattern of FOOT-STRUT, we can see from the EDA distribution that the northern form is in fact receding, which indicates a bias in favour of the southern pronunciation, [2]. Further evidence of such a bias, is that the model predicts levelling of both variants, when in fact the southern form remains dominant in the EDA.
The effect of increasing the conformity number is to make single variant regions increasingly pure, boundaries sharper, and more strongly affected by variations in population distribution. In this paper, for every variable, we have run the model with β ∈ {1, 2, 3}. Mean errors in the biased and no-bias model are summarized in Table 2. For the no-bias model we compute average errors excluding cases where these exceed a given threshold, C, allowing us to consider only cases where the model provides a plausible description of changes. In the biased model we find that conformity generates a better fit than no conformity (a paired t-test with null hypothesis of equal mean error between β = 1 and β = 2 yields p-value 0.0016). There are small differences ( 0.01) between the cases β = 2 and β = 3. In the no-bias model, when errors are within one standard deviation of the biased model error, the conformity model performs marginally better than the conformity free case. If we raise the error threshold to two standard deviations the results are equivocal. Table 2 therefore provides only limited evidence that β > 1 in the no-bias case. We suggest that β > 1 is more plausible because in the long term, linguistic boundaries could not spontaneously emerge or be maintained if β = 1. Moreover, without conformity the unbiased dynamics simply mixes variants, dissolving boundaries, but leaving total variant frequencies over the system as a whole unchanged, meaning that levelling is impossible. Previous modelling work [31] also indicates that the SED distributions are more likely to have been generated by a model with conformity. Finally, we note that the substantial linguistic changes observed during our period of interest may be driven by a wide range of factors, of which conformity is just one, makings its effects harder to discern. A complete set of maps for all variants of all variables for the SED, EDA, no-bias, and bias models (bias fields included) for β = 2, is given in the supplemental material for this paper. For the remainder of the paper we focus on the case β = 2, and consider spatial varying bias (σ s = 6km), approximately constant bias (σ s = 150km) and no bias. Here ε bias and ε denote errors in the bias and no-bias models respectively. The shorthand ε|ε < C denotes mean error conditional on ε < C. Counts give the numbers of questions satisfying the condition. Note that C = 0.17 is one standard deviation above the mean error in the β = 2 bias model, and C = 0.22 is two standard deviations above. For comparison, the mean error with β = 2 and spatially constant bias is 0.15 ± 0.09.

Shrinking droplets
Another phenomenon seen in ferromagnetic interfaces, and in our model, is the shrinking droplet. In two dimensions a droplet is simply a closed curve, evolving under surface tension, which acts to shorten its length. As a result, droplets become more circular and eventually shrink to nothing. In our model, droplets of variants which are in the minority are further reduced by migration. As an example, consider the form of the 3sg feminine possessive pronoun, hers or hern (variable 15, Figure 10) undergo near-complete levelling (disappearance) without bias.

Bias driven evolution
The evolution of many variables in Table 1 may be understood as the result of combined spatial and bias effects. The primary effect of bias is to lend certain variants a special status which may have either linguistic (internal) or social (external) origin. In some cases, a bias appears to accelerate primarily spatial processes. Figure 11 shows the average bias per cell applied to the most biased variant of each variable, using β = 2 (similar results are obtained with β = 3). Variables for which the no-bias model produces an error < 0.2 are starred and, with two exceptions, these variables have the smallest maximum bias per cell. Bias magnitude therefore provides an indication of the extent to which the pure spatial model fails to explain observed changes between the SED and EDA. For the more extreme of the two exceptions (variable 6, Figure 12), the no-bias model correctly predicts the main 20th century change: disappearance of the stopped pronunciation of "three" from the West-Country, but fails to predict that the highly localised Essex pronunciation "free", although it remained a minority variant, would spread across the country, requiring a substantial bias in its favour, to prevent complete levelling.

Special status variants
Consider variable 10, the word for a small piece of wood stuck under the skin. Minority variants like shiver and sliver (see Figure 13) are correctly predicted by the no-bias model to undergo near-total levelling. Such patterns are most typical of variables with numerous variants, many of which were historically restricted to small regions. In contrast, the present day distribution of splinter cannot be explained without bias ( Figure 14). In  the SED this was the dominant southern variant, and the calibrated bias field indicates that it must have attained a special status as a standard variant in order to spread north. However, this progress was halted in far north east England, where spelk persists. Our spatially invariant bias field (Figure 14 (b)) indicates that this effect can be partially, but not completely explained by the region's geographical isolation. According to the spatially varying bias field, spelk not splinter is the variant with special status in the north east ( Figure 14 (a)), without which it would be dominated by splinter, except in the major population centre of the region (Newcastle). A similar effect is observed in variable 12, with the preservation of weak-vowel merger in the north east, when the rest of England has moved to the standard variant (no merger). The no bias model predicts the initial expansion of splinter due to its dominance in the densely populated south. However, its progress north is arrested by repulsion effects [7] created by the major urban conurbations of Manchester, Sheffield and Leeds ( Figure 15). We find that these cities play an important role in the pronunciation of "thawing" (variable 7, Figure 18). An important question is: how should the bias fields be interpreted? In the strictest sense the fields are merely model fitting parameters which measure how much asymmetry needs to be injected into the model to explain the observed changes. Where the variant symmetric model adequately explains the observed changes without the need for additional bias, we expect to see a calibrated bias field close to one. This does not mean that in reality there is no linguistic or social bias, only that conformity and migration alone are sufficient to explain the changes. For example, in Figures 14 and 15 because the no bias model predicts the first part of the northerly expansion of splinter, minimal bias is needed in the south. The major obstacles to further expansion are the cities of Manchester and Sheffield, which consequently have high bias field. This does not imply that surrounding areas necessarily have lower bias, only that it is not needed to explain observations. The bias field is therefore most informative when viewed in conjunction with the predictions of the no-bias model (a complete catalogue is in the supplemental material [104]).
In several cases the no-bias model would fail to predict levelling at all, or would even predict levelling in the opposite direction to that observed. Here the direction of levelling is driven by bias: some variant-specific external or internal factor such as normative bias or markedness must be responsible for the rise in the successful variant. An example of the first type is variable 19 (Figure 16 and to a lesser extent [aU]) were found in complex domains with none in the majority, the no-bias model incorrectly predicts a great deal of mixing without much change in relative proportions (i.e. without levelling). A powerful bias term ( Figure 11) is needed to predict the observed near-complete levelling to [aU] (see Figure 17). In a different linguistic domain, we consider the initial element of the 3sg.m. reflexive pronoun (variable 14). Users of the standard variant, him(self), represented a small minority in the SED and the no-bias model predicts that this variant would disappear; instead, the opposite happened, with substantial levelling towards him(self) in the EDA data. In the model, this requires strong bias in favour of him(self) (see Figure 11).

Isogloss positioning
In certain cases, although the direction of change is driven by positive bias throughout the spatial domain, the model correctly predicts the new location of an isogloss as a product of coastline and population density. In Figure 18, the model captures the movement northward of the intrusive r isogloss so that its current position divides the country in a line stretching from the Humber estuary to the Ribble estuary (the result of coastline in-  dentations) with irregularly shaped differentiation within the northern area (the result of population density patterns). Here the spatially varying and spatially constant bias field produce very similar results (Figure 18 (a) and (b)) indicating that isogloss positioning is influenced in a partially predictable way by geography and population distribution, as predicted in refs. [7] and [29]. The same isogloss appears (spuriously) in the bias free evolution of variable 10 ( Figure 13).

Accelerating spatially driven changes
For some variables, a pure spatial model would predict the direction of change correctly, but at a slower pace than observed: here the bias term is needed to increase the rate of change. An example is the changes affecting the circumstances in which the rhotic consonant /r/ is pronounced (variable 5): the no-bias model correctly predicts that the rhotic-non-rhotic isogloss in the south is retreating westwards and that rhoticity is becoming less categorical within the rhotic regions, but greatly underestimates this trend; bias in favour of non-rhoticity is required to explain how advanced the change is (see Figure  19). Another example is the levelling of the 3sg.f. possessive pronoun (variable 15). The no-bias model correctly predicts that the traditional southern and Midlands variant hern is levelled in favour of standard hers to the extent that no clear isogloss is left, but incorrectly predicts remaining hotspots of traditional usage in high population density areas (Birmingham, Portsmouth/Southampton, Bristol). Bias towards the standard variant accelerates the change in the model, so that these hotspots too are levelled out. A more striking example is variable 8 (the word for autumn) where the no bias model predicts that the autumn-backend isogloss will move ≈ 100km north, whereas in reality backend has all but disappeared, being used by only ≈ 20% of the population in the far north.

What does bias represent?
In some cases bias plausibly reflects the existence of a phonetically natural direction of change (the pronunciation of "l" in shelf shifting from clear, through dark, to vocalised [l], for example, but not the reverse). In principle, interaction between different dialect features could even result in such internal bias being spatially variable, although there are no absolutely convincing examples of this in these results. In rather more cases, however, it seems most likely that bias represents external (non-linguistic) factors and in particular the standard ideology. We can see this wherever bias favours a long-standing standard variant. Examples include the 3sg.m. reflexive pronoun (variable 14), rhoticity (variable 5), and the 3sg.f. possessive pronoun (variable 15). Another example is the decrease in "h dropping" (variable 16) between the SED and EDA (see Figure 20); because this is a phonologically unnatural change which undoes a simplification of the pronunciation system, the bias in the model in favour of [h] may reflect the normative pressure of the standard ([h] has gained the status of standard variant). These results are inconsistent with an account which centres on mobility and convergence as necessary and sufficient explanatory factors for levelling. Instead, other, variant-specific factors-primarily external factors such as the standard ideology-are required.   A final pattern worth discussing is that of innovation diffusion. In four variables, we can plausibly label the changes we see between the SED and EDA as the expansion of a recent innovation: the [f] variant of /T/ (variable 6); the [P] variant of coda /t/ (variable 22); the [ë] and [U] variants of coda /l/ (variable 9); and intrusive r (variable 7). Any account of the dynamics of language change-including non-spatial accounts-must offer an explanation for how innovations take hold. An innovation, by definition, begins as an extreme minority variant, and so we are faced with a challenge to explain why such variants are not immediately levelled in favour of the overwhelming majority. Possible explanations include internal linguistic reasons why particular variants might be favoured (although such explanations must also answer the question of why the change had not already taken place), and social explanations such as age vectors and incrementation or divergence for the purpose of group identity formation [21,100,108]. This is a larger question and not one that this paper seeks to answer; in our models, this too is simply subsumed under the bias term.

Time series
So far we have presented only the initial and final states generated by our model, but we can also provide a complete reconstruction of the history of each variant over the 20th Century. For example, Figure 21 shows our predictions for the evolution of intrusive r in "thawing". In principle such time series can be compared to mid-century data to validate our reconstruction of the historical changes which have taken place. Such reconstructions may have utility in historical linguistics, which aims to understand the sequences of linguistic changes which have led to the current state of the world's languages. If we were to make an initial guess at a historical language state, a model similar to ours, adapted to allow for sequential changes in variants, could be calibrated to later states to provide a predicted history. This could then be compared to predictions made by more traditional linguistic methods. In principle it would be possible to impose additional constraints on the model, set by further data, at intervening points in its evolution. Where changes are still in progress, the model might also be used to predict the future. As an example, we have calibrated a constant bias field for variable 4 (FOOT-STRUT), giving model error 0.106. We then used this field to evolve the model into the future, using the EDA as initial condition. The results are shown in Figure 22, up to the year 2066. Constant bias was used to reduce the chance of over fitting.

Conclusion
We have developed an explicit spatial model of language evolution, accounting for mundane daily mobility, realistic migration patterns, spatial distribution of population, and social conformity. One motivation is that many scholars examining diffusion and levelling in English English over the last century have argued for the primacy of movement (both "routine" mobility and migration) in driving change (cf. section 2.3). This-tacitly or explicitly-implies that ideological and normative factors and internal linguistic pressures play only secondary roles: that widespread levelling would have taken place even without the strong standard ideology that characterises English language attitudes, and that this ideological context served only to hasten or to determine the precise direction of processes which were the product of more fundamental social-technological changes. In scientific terms, we can think of the extreme form of this view, where all change is driven by purely spatial processes (diffusion, long range mixing, interface motion), as a null model -the variant symmetric form of our model. We have shown that although the evolution of some variants can be explained in this way, the majority of historical changes could not have occurred without certain variants gaining special status. In some cases this bias merely accelerates or further drives changes which would have partially occurred by spatial effects alone, requiring a relatively low level of asymmetry. In other cases, strong biases are required either to drive minority variants such as him(self) to national dominance, or to preserve distinctive local forms such as spelk. In most cases the origins of these biases appear more likely to be external social factors such as normative standards, rather than internal linguistic factors. We have also seen that, even when bias is needed to explain observations, spatial effects often play an important role. Our conclusion is therefore that spatial process are important, but insufficient to explain the majority of the changes we have seen. A complete catalogue of results for all variants, including bias fields may be found in the supplementary material [104].
Beyond assessing the importance of spatial processes to language change, this mod-elling exercise represents a step towards the construction of minimalist spatial models which can be used to reconstruct more general historical linguistic changes, and potentially predict future change. In providing a complete hypothetical time series for a long period of time, we are offering explicit predictions which can be compared to data collected in England at any point through the century. Since the bias field is not an observable quantity, it is only these intermediate states that are directly falsifiable. We hope that such comparisons may be used to test, improve or refute our model in future.

Code and data availability
No new data were created or analysed in this study. Burridge is happy to be contacted with reasonable requests for additional maps, or source code.