Q-learning-based migration leading to spontaneous emergence of segregation

Understanding population segregation and aggregation is a critical topic in social science. However, the mechanisms behind segregation are not well understood, especially in the context of conflicting profits. Here, in the context of evolutionary game theory, we study segregation by extending the prisoner’s dilemma game to mobile populations. In the extended model, individuals’ types are distinguished by their strategies, which may change adaptively according to their associated payoffs. In addition, individuals’ migration decisions are determined by the Q-learning algorithm. On the one hand, we find that such a simple extension allows the formation of three different types of spontaneous segregation: (a) environmentally selective segregation; (b) exclusionary segregation; and (c) subgroup segregation. On the other hand, adaptive migration enhances network reciprocity and enables the dominance of cooperation in a dense population. The formation of these types of segregation and the enhanced network reciprocity are related to individuals’ peer preference and profit preference. Our findings shed light on the importance of adaptive migration in self-organization processes and contribute to the understanding of segregation formation processes in evolving populations.


Introduction
Segregation, a fundamental social phenomenon that has drawn the attention of sociologists and economists [1,2], is related to individuals' religion, language, wealth, skin color, and social relationships. Schelling [3][4][5] proposed a well-known segregation model in the 1970s based on a two-dimensional square lattice consisting of two types of individuals distinguished by color, e.g. white and black. The model depicts the dynamics of segregation and finds that individuals of the same type congregate, while those of different types gradually separate, as shown in figure 1(a). This helps reveal the formation process of urban areas, homogeneous communities, racial discrimination, and residence [2,6].
Personal preferences and movement patterns are key components of the Schelling model. The former is usually configured as a utility function with a tolerance parameter that is defined as the critical number of neighboring peer, and the latter is a random movement that satisfies personal preferences. More precisely, people prefer the location in which they are surrounded by more of their peer. Given a binary utility function, individuals feel 'happy' (getting one utility) if they are in a location in which the number of similar individuals exceeds a certain threshold. Otherwise, 'unhappy' individuals (getting zero utility) try to satisfy their preference by moving to a neighboring void site randomly selected. For further research on segregation, more modified variants of Schelling's model have been presented [7][8][9][10][11]. However, quantitative analysis of segregation is nearly impossible since previous researchers could only distinguish segregation by visualized the distribution of individuals. Inspired by some physical methods, researchers proposed several metrics to quantify segregation that are useful to measure the degree of segregation in a population [12][13][14][15].
Human behavior is considered to be driven by their profit preferences. In reality, for example, in order to get more benefits, the rich gather together and merge industries, so their work and social circles are separated from the poor, and the economic gap that forms also increases the separation of the poor and the rich [16,17]. Previous research on segregation has concentrated on peer preferences, however, there has been little research on the segregation caused by profit preferences. The conflict of personal profit preferences leads to the emergence of social dilemmas [18,19], in which cooperative behavior in favor of collective interests is threatened by self-interested acts of defection and disappears. Evolutionary game theory is a powerful paradigm for exploring the evolution of individual conflicts of interests [20][21][22][23][24][25], based on which some research has found that mechanisms such as rewards [26][27][28] and punishments [29][30][31], exit rights [32,33], network reciprocity [22,23,34], and migration [35][36][37] can help the establishment and maintenance of cooperation. On the other hand, explaining the evolution of cooperation based on human behavior patterns has yielded fruitful results. Aside from the traditional tit-for-tat and win-stay-lose-shift modalities [18,38], machine learning is also a viable way to explore human behavior [39][40][41][42][43][44][45][46][47][48]. Several studies have shown that individual behavioral patterns can be interpreted by the Bush-Mosteller model [42][43][44] and Q-learning [43,45,46] in reinforcement learning, since these algorithms include a trial-and-error process as well as behavioral stimulus signals from the environment, and they follow the same logic as real-life people do to determine the best strategy [47,48].
Regarding movement patterns in the context of a social dilemma, random diffusion patterns are insufficient to characterize individual strategic migrations motivated by profit [49,50]. The success-driven rule [35], a widely adopted migration pattern, in which individuals prefer the migration destination with the highest expected payoffs. However, on the one hand, when making decisions in complex situations, individuals consider a variety of factors, they are not only concerned about how much they will gain and how much they will lose by migrating, but also stay in places with more peers driven by peer preferences, and success-driven migration only considers individuals' decisions based on the highest payoff. On the other hand, adequate information about expected payoffs of the destinations is necessary, which is the basis for individual migration decisions under the success-driven rule. In reality, the accessible information is limited for individuals, which makes the success-driven rule cannot work in the absence of such sufficient information. In contrast, the trial-and-error intrinsic of reinforcement learning allows individuals to learn from past experiences and explore optimal strategies with limited information. Furthermore, Q-learning in reinforcement learning can construct a strategic space by combining state actions, and it can reflect individuals' comprehensive strategies about peers and interest preferences [43,45,46], which helps us explore the evolution of cooperation in segregation.
In this work, we investigate segregation via an evolutionary model in which individuals are either cooperators or defectors. Individuals who want to maximize their own profit (or utility) in interaction with their neighbors can not only move to a vantage site via adaptive migration, but they also adjust their strategies by best-take-over rule. We discovered three types of spontaneous segregation in the self-organizing movements of evolving populations: (a) environmentally selective, (b) exclusionary, and (c) subgroup. The rest of this article is organized as follows. In section 2, we explain our evolutionary model after describing the concept of the prisoner's dilemma game (PDG) and the Q-learning algorithm. In section 3, we present detailed Monte Carlo (MC) simulation results and analyze them. In section 4, we summarize our findings and discuss conclusions.

PDG
Evolutionary game theory provides a basic framework for understanding the interaction between individuals in social dilemmas by constructing game models and updating rules of strategy [18][19][20][21][22]. The PDG is a pairwise paradigm in evolutionary games in which both individuals can interact with each other using either cooperation C (denoted as a vector I = (1, 0) T ) or defection D (denoted as a vector I = (0, 1) T ). Mutual cooperation rewards R each of them, whereas mutual defection leads to punishment P. When a cooperator meets a defector, the former receives the sucker's payoff S and the latter receives the temptation to defect, T. In general, the four elements of the payoffs satisfy the conditions T > R > P > S and 2R > T + S, implying that mutual cooperation maximizes collective benefit, whereas unilateral defection maximizes individual benefit. Thus, the Nash equilibrium of the PDG is mutual defection. To simplify the model without losing generality, we use the so-called weak PDG with fixed R = 1, T = b, and S = P = 0, so the payoff matrix is as follows: so that an individual, say i, with strategy I i can have a pairwise interaction with the neighboring individual j with strategy I j and obtain the payoff

Q-learning algorithm
Q-learning is a classical algorithm in reinforcement learning, allowing individuals to make optimal decisions according to the state-action value (so-called Q-value) [45]. More specifically, each individual chooses an action a(t) ∈ {a 1 , a 2 , . . . , a n } based on the maximum Q-value Q s,a (t) of the current state s(t) ∈ {s 1 , s 2 , . . . , s m } at time t, and obtains a reward r s,a (t). The state is then transferred to the state of the next time s(t + 1). The Q-value can be used to estimate the future benefit of taking an action in a given state and is recorded in the Q-table associated with the state (row) and action (column): The optimal decision is usually difficult to solve, especially with high-dimensional state and action sets, but the value iteration method can be used to approximate the value of each action by combining the action-state-reward [46]. Thus, each element of Q-table can be estimated by equation (4): where α ∈ (0, 1] is the learning rate, γ ∈ [0, 1) the discount factor used to measure the importance of future rewards, and max a ′ {Q s ′ ,a ′ (t)} the maximum Q-value in the next state s ′ , which represents an estimate of the maximum state-action value in the future state.

Evolutionary segregation model
In the proposed model, n individuals are placed on the K-neighborhood lattice grid (unless otherwise specified, we use the von Neumann neighborhood and set K = 4) with N = L × L sites and periodic boundary, where each site is either occupied by one individual or empty, and the population density is defined as ρ = n/N (n < N). MC simulation is used to implement the model. Initially, individuals are distributed on the grid randomly and hold either C or D with equal probability. In each time step, individuals go through three stages in sequence: migration, interaction, and updating.

State Preference
Description of state Peer Profit S1 ns ⩾ n dis n C ⩾ n ϕ This location satisfies the peer preferences and is profitable. S2 ns ⩾ n dis n C < n ϕ This location satisfies the peer preferences but provides meager profits, and the loss due to migration is slight. S3 ns < n dis n C ⩾ n ϕ This location cannot satisfies the peer preferences but is profitable.

S4
ns < n dis n C < n ϕ This location cannot satisfies the peer preferences, provides meager profits, and the loss due to migration is slight.
where ns (n dis ) is the number of neighbors with the same (different) strategy as the focal individual; nC and n ϕ denote the number of cooperative neighbors and adjacent void site, respectively.
In the migration stage, each individual can adopt the migration action and adjust its spatial location through Q-learning for adaptive migration. Specifically, we define the four states of an individual from two aspects, as follows. (a) Characterizing an individual's peer preferences by comparing quantitative relationships between similar and dissimilar individuals in the neighborhood [4,11]: it is worth noting that the types of individuals are differentiated according to their own strategies, which is different from the previous Schelling model. (b) Characterizing an individual's profit preferences by comparing the number of cooperative neighbors and adjacent void sites: more void sites provide individuals with more directions for migration, but migration may cause individuals to leave a more profitable environment, especially when there are more cooperators than void sites, since only cooperators can bring benefits. Therefore, on the one hand, we define the individuals' states by comparing the number of neighboring peers (i.e. the neighbors with the same strategy) and other individuals with a different strategy. On the other hand, we also take the number of cooperative neighbors and void sites into consideration. For example, one possible state for a cooperator is that the current location does not satisfy all preferences, in other words, the number of cooperative neighbors is less than that of defectors, and the number of void sites is also less than the number of cooperative neighbors, as shown in the table 1. Action set A consists of K + 1 migration actions (i.e. an individual can select one of K nearest-neighbor sites as migration directions or just stay at the current site). The available action setÃ(Ã ⊆ A) is the set of moving to the surrounding void sites and staying at its own location, as shown in figure 1(b), since individuals cannot migrate to the adjacent location that has been occupied. We consider the 'ϵ-greedy' method [46], in which each individual takes the action with the highest Q-value with a probability of 1 − ϵ, or chooses one at random with ϵ In the interaction stage, each individual plays a pair PDG with its current neighbors. In general, a focal individual i obtains cumulative payoffs where Ω(i) is i's neighbors set, but the isolated individual that is surrounded by void sites obtains nothing. In the update stage, an individual's strategy updating is governed by the best-take-over rule, in which he mimics the strategy of the neighbor with the highest cumulative payoffs (including himself). Individuals' evaluations of migration behavior depend on the payoffs from the interaction, because individuals are motivated to migrate to locations that satisfy all the preferences and benefits from the interaction. Therefore, we used payoffs from the game to update the associated Q-value according to equation (4).
We set the grid size L = 100-500 in an asynchronous MC simulation, and the reinforcement learning parameters were set to α = 0.1, γ = 0.9, and ϵ = 0.02. To guarantee stable results, the final results were obtained by averaging the last 5000 time steps over 1 × 10 6 MC time steps.

Results
To investigate segregation in migration, we concentrated on how population density ρ and dilemma strength b affect the formation of segregation in social dilemmas. Inspired by the previous works [12][13][14][15], we defined the mixed index (MI) to quantify the degree of segregation:  where ϕ represents the void site, L i-j (i, j ∈ {C, D, ϕ}) is the number of i-j strategy links, and F C is the fraction of cooperators. MI C (MI D ) indicates the connection degree of dissimilar individuals at the boundary of C (D) clusters, so MI is the weighted average of that of C and D clusters and can measure the overall level of mixture. In particular, MI → 1 indicates that dissimilar individuals are evenly intertwined, while MI → 0 shows the complete segregation in the population. We present the b-ρ phase diagram in figure 2 and cross-sectional diagram in figure 3, respectively. The best-take-over rule means that any strategy with a higher payoff is always imitated, so the spread of cooperation depends on the strategy changes at the C-D strategy link. Taking into account all possible payoff combinations of such C and D, the critical condition for defectors to further exploit cooperative neighbors are b = 1, 4 3 , 3 2 , and 2, respectively [36], as shown in figure 2. Compared to random migration and success-driven migration, adaptive migration can promote cooperation, and especially cooperation can dominate at a low dilemma strength(b < 4 3 ), see figure A1. For population density ρ, as expected, network reciprocity does not work when the population is too sparse (ρ ≲ 0.205 and ρ ≲ 0.265 for low and moderate dilemma strength, respectively). Cooperation emerges as population density rises slightly and discontinuous phase transitions (from the C phase to the C + D phase) occur, as shown in figure 3(a). With the increasing of ρ, pure cooperation (the C phase) appeared first at low dilemma strength, followed by C-D co-existence (the C + D phase), whereas cooperation and defection always co-existed at moderate dilemma strength. As the increase of b, a discontinuous phase transition occurs which is related to the previously mentioned critical condition for b, see figure A1. Here, we are more interested in the bizarre segregation in the C + D phase, since the systems are occupied by only one strategy in the pure strategy phases C and D, which leads to environmentally selective segregation. Interestingly, we found that the level of mixture is different in the C + D phase, where C and D are completely separated in the region of lower b and ρ, but moderate b or ρ results in complete population mixing, and the degree of mixing increases with population density rises, see figure 3(b). We also investigate the case near the boundary of the discontinuous phase transition. The establishment of cooperative clusters is essential for the emergence of cooperation during migration, but too low population densities or initial values of F C are not conducive to the formation of cooperative clusters (see figure A3). However, once such a cluster can be established through self-organization, cooperation can spreads in the population on the basis of the survived cooperative cluster (see figure 7), which is different from the previous bistability phenomenon that occurs on discontinuous boundaries [51].
To understand the impact of migration on segregation, we first considered two migration patterns, i.e. adaptive migration and random migration, in a dense population (ρ = 0.85). We present snapshots of the system in figures 4(a) and (b). Compared to the dispersed cooperative clusters formed by cooperators under random migration in which the size of C, D, and ϕ (void site) clusters have a power-law distribution (see figure A5), adaptive migration changes the distribution characteristics, and cooperators are divided into numerous more large C clusters that intertwine with defectors to form a block-like structure and leave empty sites to converge. The individuals are more aggregated under adaptive migration, with few empty sites between individuals despite the higher population density. This implies that adaptive migration facilitates this ordered segregation.
To further clarify the two types of segregation in the C + D phase, we show the steady state of two typical scenarios for b = 1.2 and 1.4 in figures 4(c) and (d). Regardless of the value of b, cooperators prefer to agglomerate and can even form huge clusters in sparsely populated situations. Defectors with lower b have no advantage over a tight cooperative cluster and are therefore excluded, leading to exclusive segregation. As dilemma strength increases, defectors are able to invade the C cluster and invade it to spread rapidly. In the process, defectors divide the huge cooperative cluster into regular block-like subgroup, a phenomenon known as subgroup segregation. Furthermore, we considered other lattice networks with K = 8, 12, and 20. The number of feasible payoff combinations increases with the number of individual neighbors, resulting in the critical condition changing with the network topological structure. However, this has no influence on the evolutionary route of cooperation, and there are still three phases in the system. Furthermore, similar block-like subgroup still exist in the population, the shape of which is related to the interaction scope of individuals. Detailed results are presented in the figures A6 and A7 of appendix.
The formation of subgroup segregation is the most perplexing of the three segregation states. To understand how this segregation emerges during migration and how adaptive migration works, we present the evolution dynamic and snapshots in figures 5 and 6, respectively. In order to figure out why adaptive migration can lead to enhanced network reciprocity, we divided the evolutionary dynamics into three periods during the cooperative dynamics according to the concept reported in [34]. These periods are: (a) the enduring (END) period, in which the global cooperation fraction decreases as isolated cooperators can be invaded by defectors; (b) the expanding (EXP) period, in which the fraction of global cooperation increases since the formed cooperative clusters can help cooperators expand their territories [22,23,34]; and (c) the stability period, in which F C gradually stabilized and the population gradually gathered. The state transition and destination state distribution of individuals in different migration periods are used to describe the characteristics of adaptive migration. In the END period, both cooperators and defectors have high mobility to explore locations that can satisfy either peer or profit preferences. Although the initial random distribution makes it difficult for cooperators to encounter peers, most of them disappear during exploration. As a few cooperators explore locations that satisfy their peer preferences, some tiny cooperative clusters can be formed, and their exploration behavior rapidly decreases, as shown in figures 5(b) and 6(b).  During the EXP period, individuals' migration decisions are dominated by profit preferences, causing most individuals to rapidly approach cooperative clusters where their preferences can be satisfied, facilitating the development of cooperative clusters, as shown in figure 6(c). In contrast to success-driven migration, where all individuals become static once F C reaches its maximum value (see figure A1(b)), the population is further gathered under adaptive migration, driven by peer preferences (see figure 5(d)). This peer preference-driven migration causes dissatisfied individuals to constantly adjust their location during the stability period, allowing the entire population to aggregate and form an intra-population block-like structure in figure 6(e).
It is clear that the rise of cooperation is dependent on the formation of cooperative clusters at the end of the END period. Under the migration mechanism, the movement behavior is beneficial to the aggregation of cooperators, but too much exploration behavior may cause the cooperative clusters to break up. To investigate the effect of adaptive migration on cooperative clusters further, we placed a tiny cooperative cluster in a sea of defectors, and the cooperative cluster only consists four cooperators, as shown in figure 7(a). Based on the critical condition for the invasion of defectors mentioned above [36], and such a tiny cooperative cluster is a critical condition for supporting the survival of cooperators. Similar to the evolutionary process in figure 5, The rapid reduction of exploratory migration by cooperators is due to the satisfaction of preferences, which not only preserves the stability of the cooperative cluster but also triggers an outbreak of cooperation, and eventually leads to the aggregation of the population. This means that under moderate distress intensities, a seemingly neglected tiny cooperative cluster can become a core attraction for population aggregation under adaptive migration. The tiny cooperative clusters may form as a result of self-organization of cooperators, or behavioral noise [37].

Conclusion and discussion
In this work, we proposed an evolutionary segregation model combined with adaptive migration on an incompletely populated network; here, each individual could decide whether to migrate and choose an available direction as its own migration action according to the Q-learning algorithm. The strategies adopted by individuals could evolve by best-take-over rule. We discovered that unique results emerged via the dynamics of individual interaction and adaptive migration; then spontaneously abundant segregation in the population not only completed separation but also highly ordered mixing as follows: environmentally selective segregation, in which the environment favored one type of individuals and led to the extinction of the other; furthermore, in the phase where dissimilar individuals co-existed, both exclusionary segregation and subgroup segregation appear.
The proposed model depicted that adaptive migration promoted the evolution of cooperation, and individuals can balance peer and profit preferences under adaptive migration, and they migrate driven primarily by peer preferences during the enduring period, which allows cooperators to encounter peers and form clusters quickly; individuals migrate closer to cooperative clusters during the expanding period, driven primarily by profit preferences, which promotes the expansion of cooperative clusters; and individuals continuously adjust their location during the stability period, driven primarily by peer preferences, which facilitates population aggregation and the formation of intra-population block-like structures. This process of migration stimulated individuals to create orderly reorganization and segregation in populations' self-organization process. Several studies have indicated that people prefer to live with others who share similar personal preferences and socio-economic status, because similar people are more likely to form reciprocal relationships [2,6,52]. Accordingly, the segregation of black and white residential areas, as well as the formation of various social communities, appear to be explained by reciprocity on this [3,8,53], it also shows that real individuals balance peer and profit preferences.
Furthermore, we noted that a higher dilemma strength exacerbates the conflict of interests between cooperators and defectors. This is primarily why people cannot form a completely homogeneous giant cluster in subgroup segregation, which is consistent with Zubrinsky's observation that severe racial discrimination and other constraints on residential opportunities impede urban integration [54]. Although the proposed model only considered cooperation and defection, more individuals interaction patterns should be considered in the future. Moreover, we believe that exploring segregation based on individual interaction patterns is one of the most effective approaches. As we observed the process of separation and recombination among dissimilar individuals-where cooperative clusters acted as attractors during evolution-the characteristics of the population were similar to those in some real dynamic systems, such as the radial growth of cities and the reorganization of urban blocks [53,55]. Given this, another open concern is the role of attractors in evolutionary dynamics, which could help us further understand self-organization processes in society. Although many real-world systems exhibit characteristics of heterogeneity, the proposed model cannot directly explore the heterogeneous population scenarios, due to simple Q-learning algorithm structure. The combination of heterogeneous networks and deep reinforcement learning is undoubtedly worthy of our further consideration.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).   figure A3). The stable cooperation fractions are either 0 or 1, which are sensitive to the initial distributions of strategies. In particular, if cooperative clusters can be formed through self-organization during the enduring period, the   Figure A5. Characteristics of segregation are markedly different between adaptive migration (RL) and random migration (Random). The properties of segregation in figures 4(a) and (b) are quantified by the distribution of cooperator, SC, defector, SD, and void site, S ϕ , cluster size. We also provide the properties of population at random initialization (Initial). Fitted color lines follow the power-law distribution Y = cX β . Initially, the size of C, D, and ϕ clusters exhibits a power-law distribution due to the random distribution of strategies. This property persists after random migration. However, the adoption of adaptive migration changes this characteristic, where cooperators are divided into numerous medium-sized C clusters, and leaving void sites to converge. population turns to full C state, otherwise it turns to pure D state. For b = 1.334 and 1.449, cooperation can coexist stably with defection as long as 0.05 ≲ F ini C . As mentioned in main text of figure 2, there are three type of segregation, we present the spatial distribution of strategies for these segregation in figure A4. Complete segregation occurs when cooperators defeat all defectors with a lower b, but cooperator agglomeration continues until a single C super-cluster forms. When b is too high, cooperators are defeated, and some defectors are also gathered due to peer preferences, see figure A4(c). To investigate the impact of topology on segregation, we investigated various lattice network structures for K = 8, 12, and 20, these results are shown in figures A6 and A7. There are still three phases under the interval of ρ and b, namely the C phase, the D phase, and the C + D phase, which is   similar to the evolutionary results on the four-neighbor lattice shown in the main text. The types of segregation have not changed, and there is still subgroup segregation in the C + D phase, as shown in figure A7.
Finally, we study the effect of reinforcement learning parameters on segregation in figure A8. We only focus on cases in the co-existence phase. In the Q-learning algorithm, the learning rate α only impacts the speed of numerical iteration. Therefore, as long as the iteration time is long enough, individuals can identify the optimal migration action regardless of the value of α. The discount factor γ defines the importance of future benefits in decision making. individuals tend to be 'myopic' and choose to stay at their current sites for lower γ, which affect the balance of individual preferences, and aggregation of individuals, since both cooperators and defectors are more likely to stay put and form loose clusters. When γ is high enough, it has no effect on results. Besides, the formation of the segregation is insensitive to lower error probability ϵ. In this paper, we only consider cases of ϵ ⩽ 0.1 because the movement of individuals is more random for large ϵ, which will not be discussed here. We assume that individuals are more concerned about future payoffs and a lower error probability, so we set γ = 0.9 and ϵ = 0.02 in the main text without loss of generality.