How cognitive and environmental constraints influence the reliability of simulated animats in groups

Evolving in groups can either enhance or reduce an individual’s task performance. Still, we know little about the factors underlying group performance, which may be reduced to three major dimensions: (a) the individual’s ability to perform a task, (b) the dependency on environmental conditions, and (c) the perception of, and the reaction to, other group members. In our research, we investigated how these dimensions interrelate in simulated evolution experiments using adaptive agents equipped with Markov brains (“animats”). We evolved the animats to perform a spatial-navigation task under various evolutionary setups. The last generation of each evolution simulation was tested across modified conditions to evaluate and compare the animats’ reliability when faced with change. Moreover, the complexity of the evolved Markov brains was assessed based on measures of information integration. We found that, under the right conditions, specialized animats could be as reliable as animats already evolved for the modified tasks, and that reliability across varying group sizes correlated with evolved fitness in most tested evolutionary setups. Our results moreover suggest that balancing the number of individuals in a group may lead to higher reliability but also lower individual performance. Besides, high brain complexity was associated with balanced group sizes and, thus, high reliability under limited sensory capacity. However, additional sensors allowed for even higher reliability across modified environments without a need for complex, integrated Markov brains. Despite complex dependencies between the individual, the group, and the environment, our computational approach provides a way to study reliability in group behavior under controlled conditions. In all, our study revealed that balancing the group size and individual cognitive abilities prevents over-specialization and can help to evolve better reliability under unknown environmental situations.


Introduction
Intelligence is the ability to adapt to changes. According to this prevalent perspective, possessing general intelligence [1,2] not only enables one to perform a task correctly under already known conditions, but also to perform well under unexpected conditions. Further, in natural environments intelligent behavior is not only dependent on the (maybe limited) intelligence of the individual organism, but also involves interactions with the social and physical environment [3][4][5]. The ability to adapt one's behavior to the behavior of other group members is necessary to act appropriately in case of unforeseen events, not only in the animal world but also in high-reliability organizations (e.g., aircraft carrier or nuclear power plants) [6][7][8]-In the following, we use the term "reliability" to denote the ability of an organism to perform well even under slightly modified, unfamiliar circumstances.
While it seems intuitive that there is a triangular relationship between the individual, the group, and the environment [9], we discovered a lack of research on how individual behavior and group behavior are interrelated and depend on spatial attributes of the environment [10]. Several studies have investigated intelligence and knowledge on the group level, and some have modelled groups of individuals as single agents (e.g., [11][12][13][14][15]). These studies have their origins in a variety of disciplines and have in common that they seek to elucidate the dynamics between group members. However, our understanding of how an individual actor in a group evolves intelligent behavior and reliability is still limited.
Here, we are particularly interested in how an individual's sensorimotor and memory capacity, the interaction between group members, and the environment constrain this evolution. To explore these factors in a controlled experimental setup, we used a simple evolution simulation, and we tested how specific cognitive and environmental limits influence the behavior, performance, and reliability of artificial organisms evolved in groups of various sizes.
Inspired and motivated by Pinter-Wollman et al. [10], we investigated how the behavior and performance of evolved "animats" (simulated agents with cognitive abilities [16,17]) varies in different task conditions, such as changes in the proportions of static objects, dynamic objects (moving group members), and individual sensorimotor and memory architecture. Using a simulation approach enabled us to manipulate and observe three dimensions which might influence evolved task performance and reliability: the group size (influencing the density of animats present in the environment), the animats' architecture (that is, the maximal number of available sensors, motors, and memory units), and the environmental design. In this study, we explicitly distinguish between the final task performance reached in the evolution environment ("evolved fitness" (EF)) and the post-evolutionary "task fitness" (TF), which measures the performance of the evolved animats under specific modified conditions (not encountered during evolution). High task fitness across many modified conditions indicates high reliability. High evolved fitness, but low reliability could then be interpreted as a form of narrow intelligence, while high evolved fitness and high reliability would point to more general intelligence.
We used a genetic algorithm to let the animats' behavior evolve under various evolutionary setups. Specifically, the animats were controlled by Markov brains (MBs) [17], which consisted of computational units whose functions and connectivity were determined by the animats' adaptive genome. The animats' task was to navigate through a two-dimensional world composed of two rooms without colliding with other group members (see Fig 1). Each animat could achieve a maximum score of 4 points within each trial, with a small penalty (-0.075 points) for each collision and a large reward (+1.0 points) for crossing gates between rooms. After an evolution of 10,000 generations, we tested the final animats under modified task

Results
We simulated the evolution of artificial organisms ("animats") with diverse cognitive architectures (number and type of available sensors, motors, and memory units) for 10,000 generations under various conditions. See Table 1 for an overview of all evolution simulations conducted.
All animats were evolved to travel between two rooms in a two-dimensional environment, which they shared with other animats of their same type ("clones" with the same genome), except in the "single" condition (see Fig 1(A) and Table 1). The evolutionary fitness selection occurs at the level of the genome (each generation consists of a population of 100 genomes) and is positively dependent on the average number of times that the corresponding animats ("phenotype") stepped through the gate (+1.0 points) between the two rooms. After a successful gate crossing, the same animat did not receive another reward for 100 time steps to avoid crowding at the gate. In addition, we imposed a small penalty each time they collided with other animats (-0.075 points, if not stated otherwise). Throughout, fitness values are displayed as absolute numbers with a maximum value of 4 points (corresponding to the maximal number of possible gate crossings without collisions). A detailed description of the task environments and the evolutionary algorithm is provided below in the Methods section. Table 1. Definition of simulation conditions ("evolutionary setups"). Evolutionary setups are indicated by a label G i , where the index i specifies the respective type of evolutionary setup. Differences compared to baseline configuration (top row, G 0.50 , group size of 36 animats) are highlighted in bold. In many evolutionary setups (Table 1), high final fitness values (EF > 3, "evolved fitness") were reached. Fig 1(B) displays six different heatmaps visualizing several evolved movement patterns. It is observable that animat groups with reasonable evolved fitness (EF) converge towards a "swarm"-like wall-following behavior, which is determined by both, interactions with fellow animats and interactions with the environment [4,10].

Label
Once evolved, the best genome of each final generation was selected for post-evolutionary tests under modified conditions. Specifically, we modified the following three environmental factors: (1) the number of co-existing animats, (2) the complexity of static obstacles compared to the original two-dimensional environment (see Fig 1(A), and the Methods section for details on the environmental design), and (3) the interaction conditions between agents (see Table 2). For each test condition we assessed the "task fitness" (TF) achieved in the particular post-evolutionary test environment (to be distinguished from the animats' evolved fitness (EF) reached after 10,000 generations in its original evolutionary setup). In addition, we evaluated the animats' behavior and quantified their reliability (average task fitness across modified conditions) across varying group sizes in the original environment (R).
Finally, we quantified the complexity of the evolved MBs using two measures developed within the framework of integrated information theory (IIT) [19,20]: the integrated information (F Max ) and the corresponding number of concepts (#Concepts(F Max )). The analysis was performed using "PyPhi", the IIT Python toolbox [22], using the standard settings according to [19]. PyPhi takes the evolved MBs as an input in form of their "transition probability matrix" (TPM). The TPM specifies how the states of the MB's computational units (e.g., motors and memory units) update, given the state of their inputs. In this study, all computational units are binary and deterministic (see Methods "Animat Architecture"). Briefly, F quantifies how much of the information specified by all components of a system would be lost under a partition of the system. F has been proposed as a measure of complexity, as it will be high for systems with many different components (functional differentiation) that are also highly integrated [19,23]. For a particular MB we identify the subset of computational units with the maximal amount of integrated information as F Max . For this subset, we also measure the number of components ("concepts") #Concepts(F Max ). A "concept" in IIT is a subsystem that has a causal role within the system-a mechanism within the system. A concept causally constraints both, the past and future states of the system, and is irreducible to its parts. #Concepts(F Max ) thus captures the number of internal functions performed by the subsystem with F Max . For details please refer to the original publication [19] and to [20] for an application of  [16,17,24]), the chosen measures are fairly well established [20,22,23,25] and are theoretically motivated as part of the formal framework of the integrated information theory (IIT) [19]. We organized the presentation of our results into four sections categorized according to the evolutionary setups, as shown in Table 1 (varying "group size" (Figs 2-4), "cognitive architecture" (Figs 5-7), "interaction conditions" (Figs 8-10), and "sensor configuration" (Figs 11-13), respectively). Each section contains three figures displaying (1) the fitness evolution across generations and final evolved fitness values, (2) the task fitness, reliability, and behavioral features under modified post-evolutionary test condition (see Table 2), and (3) a complexity analysis of the evolved MBs. Since the figures are redundant in their construction, we will briefly introduce their attributes: Evolved fitness: Figs 2,5,8 and 11 show (a) the mean fitness hFi evolution across generations and (b) the distribution of evolved fitness values (EF) of the final generation across the N = 30 evolution simulations that we performed per evolutionary setup. The shaded areas in (a) visualize the standard error of the mean (SEM). The boxplots in (b) visualize the evolved fitness per condition G i : Where A i 10;000 is the group of animats of the final generation of evolution simulation i2N and FðA i 10;000 Þ its fitness value (see Methods for more details on the fitness function).  Table 2. Note that the condition under which a group of animats evolved is indicated by their G i label (see Table 1). hTFi is an average fitness across the N = 30 evolution simulations per experimental setup for a specific group size GS and (modified) condition M: Next, we quantified reliability for one test dimension, across modified group sizes in the "Original" test condition. We denote this specific measure of reliability as R, computed as: Overall, only G single failed to generalize across group sizes, presumably because animats that evolved without other group members did not develop strategies to avoid collisions (compare Original to No penalty test condition, where G single performs well throughout). There is a large difference in the Blocked environment between G random , G 0. 25 , and G 0.50 , while in other environments their task fitness is comparable, pointing to somewhat different navigation strategies. (b) On average, G random is the most reliable condition across varying group sizes, followed by G 0.50 and G 0. 25 . Except for G single , EF correlates with R in all groups. (c) Note that G 0.50 and G 0. 25 change their behavior more with increasing animat density compared to G random .
Note that in this case, the average is calculated across group sizes not evolution simulations as indicated by the subscript "GS", which stands for group size with |GS| = 21 (see above). Panel (b) shows the distribution of these reliability values (R) and their dependency on evolved fitness (EF). Finally, panel (c) shows how the animats' behavior depends on the relative group size in the "Original" test environment, evaluating the probability of an animat to stand still ("no movement"), turn, or move forward. Percentages are displayed in a scale from 0-100%. MB Complexity analysis: Figs 4, 7, 10 and 13 show two types of metrics for MB complexity: (a) the distribution of integrated information (F Max ) [19,20], and (b) the corresponding number of concepts (#Concepts(F Max )) [19] per evolutionary setup. F and #Concepts(F Max ) are dimensionless quantities and therefore have no unit.

Varying group size: Evolution under specialized conditions can produce reliable agents
In a first set of experiments, we compared animats that evolved within groups of different, fixed sizes (1-72 animats), using the baseline animat and environment design in all cases, see Table 1: G 1.0-single . Preliminary results, including a comparison of the reliability R of evolution conditions G 1.0-single , were presented in [18]. As shown in Fig 2(A) and reported in [18], group size during evolution does impact the animats' ability to perform the gate crossing task (see Fig 1(A)), which impacts the final evolved fitness EF.
In our spatial-navigation task, animats in condition G single (group size of 1 animat) frequently find an optimal solution within 10,000 generations. We assume that this is due to the Despite their similar capacity for memory, G smallbrain evolved higher task fitness than G no-feedback . (b) Ceiling outliers suggest that animats in G no-feedback are generally capable of performing as well as the average animat in G smallbrain but that this is less likely. The performance of G bigbrain is comparable to G 0.50 with more distributed outcomes. https://doi.org/10.1371/journal.pone.0228879.g005 Reliability of simulated animats with group interaction PLOS ONE | https://doi.org/10.1371/journal.pone.0228879 February 7, 2020 decreased difficulty of the task in this condition since colliding is impossible, and walls (static obstacles) may still guide the animat towards the gate. Increasing the number of animats in the environment seems to make it more difficult to navigate. Animats have to develop not only the ability to cross the gate, but also to avoid collisions with other group members, which would cause a penalty [18]. Reliability R across group sizes was found to be high if the animats evolved in an environment where the density of animats was balanced (G 0.50 and G 0.25 ) (see (Fig 3A and 3B) and [18]).
In our study, we included an additional comparison setup (G random ), for which group size varied randomly during evolution. We hypothesized that animats evolved in this setup should achieve high reliability R in the post-evolutionary tests since variation in group size would already be part of their evolution. As shown in Fig 2(B), the final fitness values EF for G random were comparable to those evolution setups with fixed, intermediate group sizes (G 0.50 and G 0.25 )-though still significantly different (p < .05), see Tables A-G in S1 Text) for all statistical tests).
As hypothesized, R was found to be highest for G random (see Fig 3). Notably, however, animats that evolved under specialized conditions with intermediate group sizes (G 0.50 and G 0.25 ) (a) G smallbrain shows higher <TF> than G no-feedback across group sizes. G bigbrain is overall comparable to the baseline condition G 0.50 , but shows worse performance in the Blocked test condition and some of the modified environments for larger group sizes. (b) Reliability R correlates with EF for all setups. The lower R values of G smallbrain and G no-feedback compared to baseline can thus be explained by their already lower evolved fitness values. Note, however, that G smallbrain and G no-feedback perform better than G 0.50 across group sizes in the 4 (Messy) Rooms test conditions (see (a)). (c) For larger group sizes, G smallbrain remains static more often than G no-feedback .
https://doi.org/10.1371/journal.pone.0228879.g006 reached R values comparable to animats that already encountered variable group sizes during evolution (G random ) (see Fig 3). G 0.50 and G random show similar hTFi values in the original environment setting, particularly for larger group sizes (> 50% relative group size) (see Fig 3  (A)). Nevertheless, G random animats evolved to higher TF for smaller group sizes, leading to comparable but still significantly different average R values (p < .05) (see Fig 3(B)).
While R quantifies reliability across modified group sizes in the Original test condition, the other post-evolutionary tests (see Table 2) may reveal further differences between evolutionary setups. For example, Blocked (in which animats cannot overlap) suggests a difference in strategy between G 0.50 , G 0. 25 , and G random (see Fig 3(A)): G 0.50 and G 0.25 are more severely affected by this deviation from baseline settings in which animats can overlap, albeit under a penalty. While animats evolved in G random also experienced large group sizes with a higher likelihood of a penalty during evolution, G 0.50 and G 0. 25 animats consistently faced only intermediate probabilities of colliding with other animats, which may have led to less effective strategies for avoiding collisions. In addition to varying group sizes, we also tested the final generation of animats in four environments with different wall arrangements (see Fig 3(A), bottom row). hTFi decreased to similarly low levels in all conditions, but least for evolutionary setups with larger group sizes. Note also that G random demonstrated relatively low hTFi under modified wall arrangements. Thus, high reliability across one dimension (here, modified group sizes as evaluated by R) does not necessarily transfer to other dimensions (e.g., modified wall arrangements).

Fig 8. Fitness Evolution and distribution of the final evolved fitness.
The animats in conditions without a penalty (G blocked/no-penalty and G no-penalty ) evolved to relatively high fitness levels. In particular, G no-penalty evolved like G single , which can be explained by the fact that animats in both of these conditions were not impacted at all by other animats. Similarly, G blocked seemed equivalent to the baseline setup G 0.50 , while G blocked/no-penalty evolved to slightly higher fitness values, comparable to G random . https://doi.org/10.1371/journal.pone.0228879.g008

Reliability of simulated animats with group interaction
In terms of their behavior (see Fig 3(C)), animats in G random were less idle and showed fewer turns and more steps forward in comparison with animats in G 0.50 , particularly for large group sizes. This suggests that the movement in G random is more fluid overall (see also Table 3). By contrast, the specialized animats display larger differences in behavior across group sizes. Please refer to [18] for a more detailed discussion of behavioral differences across evolutionary setups with fixed group sizes G 1.0-single . Fig 4 shows the distribution of F Max and #Concepts(F Max ) [19,20] as a measure of the complexity of the evolved MBs across evolutionary setups with different group sizes G single-1.0 and G random . While the evolutionary setups with the highest R values (G random and G 0.50 ) do show the highest average values of F Max and the largest number of concepts (internal mechanisms), differences between conditions generally do not reach statistical significance (p> = .05) due to the large variance in the complexity values (see Tables C and D in S1 Text). We assume that it would require more data (simulation experiments per evolutionary setup) to refine the mean of the intervals enough to verify the observed trend. In our predecessor study [18], a correlation of high evolved fitness EF and reliability R with high brain complexity was found using a simplified measure of brain complexity based on anatomical connectivity only. The integrated

Fig 9. Post-evolutionary tests under modified conditions. (a)
There was a significant difference between conditions in which interactions with other agents played a role for fitness evolution (G 0.50 , G random , G blocked, G blocked/no-penalty ) and those conditions in which it did not (G single and G no-penalty ) (see text). (b) With a collision penalty imposed, G no-penalty showed similarly low reliability as G single , whereas G blocked showed similarly high reliability as G 0.50 . G blocked/no-penalty retained some reliability under collision penalty even though animats were evolved without it. (c) Similarities between G 0.50 and G blocked , as well as G single and G no-penalty were also reflected in the animats' behavior. The behavior of animats in G blocked/no-penalty was more reactive to changing group size than G no-penalty . https://doi.org/10.1371/journal.pone.0228879.g009

Fig 10. Distribution of brain complexity measures.
In evolutionary setups where crossing each other was not possible (G blocked and G blocked/no-penalty ), the brain complexity was comparable to the complexity of G 0.50 . By contrast, animats in setups where the reaction to fellow animats had no reasonable effect on their performance (G single and G no-penalty ) showed lower brain complexity. Still, there was high variance in the data of brain complexity.
https://doi.org/10.1371/journal.pone.0228879.g010 information measures employed here are sensitive to the causal interactions within the MBs and thus also capture functional aspects in addition [19,20] In the present data, significant pair-wise differences could be found between G single and the most reliable setups (G random and G 0.50 ). As explained above, the task environment experienced by animats in G single is less demanding than for setups with larger group sizes. Our observations are thus in line with [20], which demonstrated higher F Max and #Concepts(F Max ) for animats evolved in more complex environments.

Varying cognitive architecture: Brain size and memory dependencies
In a second set of experiments, we used the same environmental setup as for G 0.50 in all tested conditions, but varied the number of available computational units in the animats' MBs. In the baseline design G 0.50 , it is possible for the motor units to act as additional memory units (see Methods section). In one condition, G no-feedback , the ability of the motor units to provide feedback was disabled, which reduced the absolute capacity for memory from six to four binary units. Moreover, we designed animats with similarly small memory capacity but with feedback motors as a reference group (G smallbrain ). Those animats had the original type of motors with the possibility of evolving feedback loops, but only two memory units instead of four. Finally, we included a condition with larger MBs with eight memory units and motor feedback (G bigbrain ).
We observed that evolved fitness EF and reliability R across group sizes in the original environment decreased for animats with fewer memory units (see Figs 5 and 6). However, while The average evolved fitness showed that animats in evolutionary setups without specific sensors for other animats (G no-agent and G w = a ) achieved no reasonable fitness. By contrast, animats in G 3sides outperformed G 0.50 , and G random , but also had more outliers with lower fitness and performed worse than the baseline condition G 0.50 in early generations (up to~10,000 generations). https://doi.org/10.1371/journal.pone.0228879.g011 Reliability of simulated animats with group interaction animats in G smallbrain still evolved to reasonably high fitness and reliability, G no-feedback was lacking in both. This observation indicates that motor feedback facilitates evolution in our task environment. One reason could be the fact that motor feedback allows the animats to utilize information about past movements directly (e.g., like the sensation of one's legs). One behavioral difference between G no-feedback and G smallbrain was the reduced movement in the animats of G smallbrain (see Fig 6(C)). Furthermore, the state transition analysis shows that the motor units of animats in G smallbrain tend to change their behavior more often, while animats in G no-feedback stay in the same state more often (see Table 4). Notably, G no-feedback and, particularly, G smallbrain performed better than G 0.50 in the 4 Rooms and 4 Messy Rooms test conditions (see Fig 6(A), bottom row).
By contrast, more memory units (G bigbrain ) do not improve the fitness evolution or the task fitness TF in any of the tested conditions (see Figs 5 and 6). While G bigbrain achieves similar results compared to the baseline setup G 0.50 , differences can be observed in the Blocked and Small Gate test conditions, as well as 4 (Messy) Rooms for large group sizes (see Fig 6(A)). In principle more computational units should allow for better performance. However, the larger space of possible solutions may also impede fitness evolution (note the larger variance for G bigbrain compared to G 0.50 in Fig 5(B) and Fig 6(B)). Here, this trade-off may explain the similar mean hEFi and R values for G 0.50 and G bigbrain .  (a-b) The G 3sides condition had the highest hTFi in most test conditions, except in Blocked and Noisy Corners. In terms of R, sensing everything (G w = a ) with one sensor is still better than only sensing the walls (G no-agent ). (c) Setups with few sensors evolved no typical behavior (high variance of movement between the 30 different evolutions, shaded area). The G 3sides setup becomes more reactive as soon as the animat density starts to rise and thus evolved a different behavioral strategy than G 0.50 and G random .
https://doi.org/10.1371/journal.pone.0228879.g012 Fig 13. Distribution of brain complexity measures. Animats in the G 3sides condition showed the lowest brain complexity of all setups despite having the highest evolved fitness and reliability. By contrast, animats with limited sensor information (G no-agent and G w = a ) had lower than baseline complexity values, but also low evolved fitness (EF, see Fig 11). https://doi.org/10.1371/journal.pone.0228879.g013 Reliability of simulated animats with group interaction Considering brain complexity, the evolutionary setups with smaller MBs (G smallbrain and G no-feedback ) have significantly lower F Max and fewer concepts than the baseline condition (G 0.50 ). Between those two conditions, G smallbrain shows significantly higher F Max and more concepts as compared to G no-feedback (see Fig 7). This correlates with the larger evolved fitness values of G smallbrain in Fig 5 and its associated higher reliability R in Fig 6. Note that calculating F Max and the corresponding number of concepts was not possible for G bigbrain since exhaustive evaluations across many systems and states are not currently feasible when using the pyphi software package to compute measures of integrated information theory for networks of that size (>10 units) [22].

Varying interaction conditions: Evolution of beneficial interaction
In our baseline configuration for the evolution simulations (G 0.50 ), individuals could occupy the same physical location but received penalties for colliding with other group members (see Methods section). We manipulated these features in the third set of simulations to evaluate how they influence both evolved fitness and reliability. Specifically, we considered three additional evolutionary setups: G no-penalty , G blocked , and G blocked/no-penalty (see Table 1 for a detailed description). G single , G random , and G 0.50 are also included in the figures for comparison.
Among the novel setups, only animats in G blocked were subject to the collision penalty during evolution. Not being able to share the same position (as in G blocked ) hardly influenced the evolved fitness EF, the mean task fitness hTFi across post-evolutionary conditions, or the behavior of the evolved animats compared to G 0.50 (see Figs 8 and 9). Likewise, G no-penalty , Table 3. Absolute difference between the state transition probability P of G 0.50 and G random (P(G 0.50 )-P(G random )). The first digit (S) describes whether anything (wall or other animat) is sensed (1) or not sensed (0), and the second digit (M) describes whether the animat moved/turned (1) or did not move/turn (0). Most notably, G random animats performed more movements even in the absence of sensor inputs than G 0.50 ("01!01").  Table 4. Absolute difference between the state transition probability P of G smallbrain and G no-feedback (P(G smallbrain )-P(G no-feedback )). The first digit (S) describes whether anything (wall or other animat) is sensed (1) or not sensed (0) and the second digit (M) describes whether the animat moved/turned (1) or did not move/turn (0). Most notably, animats in G smallbrain switched more often between sensing and moving than animats in G nofeedback ("01!10", "10!01", but "11!11"). where reacting to other animats had no direct effect on the fitness evolution, showed very similar EF, hTFi, and behavior as G single , with one exception: hTFi decreased with increasing group size in the No Penalty test condition for G single but not for G no-penalty which had evolved with a group size of 36 animats, as in G 0.50 (see Fig 9(A)). Note that R in Fig 9(B) was evaluated in the Original task condition with penalty, as for all other simulations sets.

SM
Considering the post-evolutionary tests in Fig 9(A), the top row shows hTFi across group sizes in the Original environment (with penalty) and under varying interaction conditions: No Penalty, Blocked, and both Blocked and no Penalty (from left to right). In the bottom row of Fig  9(A), animats are evaluated under the same interaction rules as they evolved in while only facing a modified environment (position of static obstacles).
In this context, it is noticeable that G no-penalty performed relatively poorly for larger group sizes when tested in 4 (Messy) Rooms despite receiving no penalty for collisions. By contrast, in evolutionary setups with a collision penalty and/or blocking hTFi increased with group size in the 4 (Messy) Rooms test conditions. The decline in hTFi of G blocked/no-penalty for larger group sizes under test conditions with a collision penalty (Original and Blocked) moreover, suggests that these animats did not avoid physical interactions with their group members. However, even G blocked/no-penalty animats had an advantage compared to G no-penalty in the 4 (Messy) Rooms environment. Taken together, these observations let us assume, that any evolutionary pressure to "pay attention" to fellow animats (through blocking or a collision penalty) could lead to the evolution of interaction strategies with possible advantages under certain (modified) conditions (e.g., using other animats for orientation or guidance).
Considering the brain complexity of animats in G blocked and G blocked/no-penalty , we can report similar values compared to G 0.50 (see Fig 10). In summary, whether animats received a penalty for crossing each other, or whether crossing was prohibited to start with, did not significantly affect their evolved fitness, reliability, behavior, or brain complexity. Likewise, the brain complexity measures and behavioral results for G no-penalty were comparable to those of G single .

Varying sensor configuration: Sensory capacity influences reliability and brain complexity
We manipulated the animats' sensor configuration (see Table 1) in a final set of evolution simulations. In addition to the baseline architecture (front wall sensor and front agent sensor), we designed animats with sensors on three sides G 3sides (front, left and right wall and agent sensors), without an agent sensor G no-agent (one front wall sensor only) and with one universal sensor G w = a (sensing wall and agent as indiscriminate obstacles). Fig 11 reveals that our task environment required the ability to sense nearby animats and to differentiate between walls and animats in order to evolve reasonable EF values. Moreover, animats equipped with sensors on more sides achieved both higher evolved fitness EF and higher reliability R across group sizes than the baseline setup G 0.50 and G random (see Fig 11 and Fig 12B).
Overall, animats in the G 3sides condition consistently outperformed the animats in other groups except in two test conditions: Blocked and Noisy Corners (see Fig 12A). This shows that animats which are equipped with more sensors do have an advantage on average, but they may still perform worse than animats with fewer sensors under special circumstances (here: Noisy Corners). We assume that the sensory signals in these specific environments might have been too different from the information patterns the animats evolved in and were thus specialized for. Nevertheless, the additional sensors led to high reliability R across group sizes as well as relatively high task fitness for most modified wall-arrangements even though the animats evolved under a specific group size and a fixed wall configuration (see Fig 12A and 12B).
While G w = a animats had only one sensor which does not discriminate between the wall and other animats, G no-agent was missing the animat sensor completely. Still, G no-agent showed better task fitness than G w = a in test conditions with small group sizes and without a penalty. Considering the evolved behavior, G w = a animats (see Fig 12(C)) were not reactive to other animats, which suggests that they did not evolve the capacity to differentiate between the animats and the walls internally, e.g., through memory. While G w = a and G no-agent moved forward at similar rates, G w = a performed proportionally more turns than G no-agent , which stood still more often.
Analyzing the brain complexity showed that animats equipped with fewer, but also with more sensors than in the baseline setup G 0.50 evolved MBs with lower complexity (see Fig 13), albeit for different reasons. Based on the very low evolved fitness for G w = a and G no-agent (see Fig 11) we conclude that their MBs did not develop the necessary structure and mechanisms to solve the task, as reflected by their low brain complexity. By contrast, animats in G 3sides achieved high EF, <TF>, and reliability R across group sizes, but did not evolve any integrated information (F Max = 0) in most cases. This observation was in line with previous findings on the relation between sensory capacity and internal complexity [20] and suggested that high brain complexity in cognitive systems depends on a need for internal memory and computation, which may decrease if an animat is equipped with more sensors.

Discussion
The evolution of cooperative multi-agent systems might be the next frontier in the context of evolving artificial agents. To date, however, not much is known about conditions that give rise to cooperative behavior and the complex inter-dependencies between individual and group goals [26]. For example, there might be many factors that influence whether the individuals either bow to the group or act by egoistic rules [27]. In this study, we used animats equipped with MBs (introduced by Edlund et al. [24]) to study how group performance and its reliability under modified conditions depends on the individual, interactions between individuals, as well as specific features of the MBs' evolution.

Prior work investigating group evolution
Earlier research that implemented groups of MBs concentrated on predator-prey environments and showed that animats can (co-)evolve swarm behavior [28][29][30]. The animat design in this work was generally based on a design in Marstaller et al. [16], who evolved individual MBs with the goal of solving perceptual-categorization tasks. Another method of simulating swarm behavior is neuro-evolution, i.e., the evolution of artificial neural networks (ANN) [31][32][33]. As in Olson et al. [29], these neuro-evolution experiments produced agents which evolve in a swarm to solve a predator-prey task.
Other researchers have investigated the effect of group size in the evolution of groups of simulated agents beyond predator-prey scenarios in a more general context. They find that the behavior of the group of agents and the individual agent is dependent on the group size [34,35]. In another study which changed the group size during evolution, the authors show that it can be easier for smaller groups than larger ones to organize themselves [5].
The effect of changing swarm sizes has also been investigated in the context of natural biological systems: Brown [27] examined which factors are decisive for the individual to either join a swarm or behave egoistically. The study focused on experimenting with environmental qualities and swarm size. Brown defined optimal swarm size as the best trade-off between the advantage of balancing costs between individuals in the swarm and the disadvantage of sharing the resources (energy/food) with the whole swarm. In an earlier study, Pacala et al. [4] report that swarm size constrains information transfer and task allocation. They argue that the information exchange varies and the task allocation changes, depending on the swarm size of antcolonies. Pacala et al. [4] also argue that swarm behavior is the product of social interaction, individual interaction, and the interaction with the given environment. In a more recent work [36], we found arguments that swarm behavior arises if there is sufficient density within the swarm.

Factors that impact evolved fitness and reliability
Generally, the ability to evolve high fitness in a given evolutionary setup depends on the interplay between external and internal factors as, e.g., the complexity of the environment and the animats' architecture (see also [20]). Exemplary for these factors, we manipulated the group size and the animats' sensorimotor and memory capacities across evolutionary setups. Further, we evaluated how these manipulations affected fitness evolution and post-evolutionary reliability.
Different group sizes. In the specific evolutionary setup investigated here, evolved fitness EF negatively correlated with group size, which is a result of the imposed penalty for collisions with other group members (see Figs 2 and 8, animats that evolved without the risk of penalty (G single and G no-penalty ) achieved the highest hEFi). On the other hand, animats evolved in fixed, intermediate group sizes (e.g., G 0.50 and G 0. 25 ) are most reliable to changes in group size as measured by R, and, in fact, comparable to G random , in which animats experienced random group sizes during evolution (see Fig 3(B)). The optimal group size for high R in our experiments is thus larger than the optimal group size for high EF, or individual fitness. This observation suggests, more generally, that unexpected changes in group size during evolution may sometimes lead to larger group sizes than expected based on what is best for an individual within the group.
Capacity for memory. Animats with less capacity for memory (G smallbrain and G no-feedback ) evolved to lower EF values than the baseline condition G 0.50 (see Fig 5). Further, the low memory setups were less reliable under changes in group size (low R). A higher memory capacity as in G bigbrain did not provide further advantages compared to G 0.50 . Given the higher variance of G bigbrain in EF and R, we suspect that the larger search space made it more difficult for the evolutionary algorithm to converge to an optimal solution.
Sensorimotor capacity. Finally, more sensors (G 3sides ) proved advantageous for both evolved fitness EF, reliability R across group sizes, and task fitness TF under almost all modified test conditions, including most modified wall arrangements (Fig 12(B)). By contrast, training animats on multiple group sizes during evolution (G random ) led to high R, but did not translate to high task performance under modified wall arrangements (Fig 3(B)). We speculate that the additional sensors allowed the animats to evolve more generalizable strategies in our two-dimensional spatial-navigation task, even though they evolved in a single static environment.
Note that we did not include a comparison condition in which animats evolved under various wall-arrangements, since it is not trivial to determine a statistically representative sample of all possible environments as part of the evolutionary simulation. For the same reason, we did not quantify average reliability across modified wall-arrangements, but provided task fitness measures for each tested wall-arrangement (Figs 3, 6, 9 and 12(A)). In addition, Table G in S1 Text lists hTFi values for all evolutionary setups and test environments evaluated in this study.
Overall, our findings suggest that, in general, animats that were well-equipped for dealing with their original task environment (and thus achieved high evolved fitness) also performed better under modified conditions that were never encountered during evolution. Within most evolutionary setups, reliability R was correlated with evolved fitness (see Figs 3,6,9 and 12(B), right panel). The only exceptions were G single and G no-penalty , which did not adapt to the behavior of other group members at all. The high evolved fitness in G single and G no-penalty could thus be interpreted as a form of narrow intelligence. By comparison, intermediate group sizes led to a somewhat more general form of intelligence.
Nevertheless, our findings also show that evolutionary setups that seem less adapted (lower evolved fitness) overall may still have advantages under some special modifications. For example, animats evolved in larger groups (G 1.00 and G 0.75 ) or with less memory capacity (G smallbrain and G no-feedback ) performed better than G 0.50 under most modified wall-arrangements (see Figs 3 and 6(A), bottom row; Table G in S1 Text). On the other hand, even G 3sides performed worse than the baseline (G 0.50 ) in one of the modified test environments (Noisy Corners).
Interactions between individuals in the group. In this study, we did not explicitly implement any form of direct communication between animats. Nevertheless, we found that it was necessary for animats to perceive their fellow group members and to distinguish them from static obstacles to achieve reasonable evolved fitness EF and reliability R (see Figs 11 and 12, where both G no-agent and G w = a overall show low values). Moreover, we observed that evolved interaction strategies provided advantages under certain modified conditions: Animats that evolved without a collision penalty (G no-penalty ) performed worse in some of the modified environments, even if tested without receiving a penalty (see Fig 9(A), 4 (Messy) Rooms). While animats in G no-penalty were equipped with an agent sensor, they had no incentive to interact with or "pay attention" to their fellow agents. By contrast, the task fitness in the 4 (Messy) Rooms conditions typically increased with group size for animats that evolved in groups and received either a collision penalty (e.g., G 0.25 -G 1.0 ) and/or could not pass other agents (G blocked and G blocked/no-penalty ) (see Figs 3(A) and 9(A)). This indicates that they may have used other agents for orientation or guidance, a form of implicit cooperation. Indeed, animats evolved in large groups (G 0.75 and G 1.0 ) showed higher task fitness than G 0.50 in these particular modified test environments (see Fig 3(A), bottom; Table G in S1 Text).
As we know from previous studies, swarm behavior in nature can be the result of simple reactions to local neighbors [3,37]. For example, it could be a good strategy to stay close to a group member without hitting it. Such evolved behavior may then provide additional fitness advantages under some modified conditions (as in the 4 (Messy) Rooms test condition here). The observed instances of cooperative behavior can thus be viewed as an emergent phenomenon of the evolutionary process.

Relation between brain complexity, evolved fitness, and reliability
Previous studies applying measures of integrated information to adaptive animats equipped with MBs [20,24,38] have observed that, on average, F Max and related measures for brain complexity increase over the course of evolution, which correlates with increasing evolved fitness EF (see Table G in S1 Text). Moreover, as demonstrated in [20], this increase depends on the complexity of the environment relative to the animats' sensor capacity: MBs that evolved in environments which require more memory and internal computation developed higher average F Max values and a higher number of concepts.
For the evolutionary setups with the baseline animat architecture as in G 0.50 , we found the highest values of F Max and #Concepts(F Max ) for medium group sizes G 0.50 , G blocked , and for G random . These setups were also among the most reliable across group sizes (see also [18] for similar results using a simplified measure of brain complexity). By contrast, significantly lower F Max values were found for G single and G no-penalty , the two setups in which task fitness during evolution did not depend on interactions with other animats. As argued above, G single and G no-penalty thus effectively evolved within a simpler task environment than G 0.50 , G blocked , and G random , which explains their lower brain complexity F Max .
Compared to G 0.50 , evolutionary setups with altered animat architectures showed consistently lower values of F Max and #Concepts(F Max ). Limiting the animats' sensor capacity (G noagent and G w = a ) or the number of available memory units (G smallbrain and G no-feedback ) interfered with their capacity for successful evolution in the spatial navigation task. Their lower evolved fitness was thus accompanied by less developed MBs with lower F Max and fewer concepts. Given more time to evolve (more generations), both their performance and their brain complexity might still increase. By contrast, more sensors allowed for better performance (EF, TF, and R) based on high amounts of external information, which effectively decreased the need for internal complexity (memory and computations) and thus may also lead to low F Max , as observed here for G 3sides .
In theory, high fitness in any given environment could be achieved without information integration (F Max = 0) if no restrictions are imposed on the animats' architecture (e.g., by a system with a large feed-forward architecture [19]). Moreover, information integration can be high even if there is no reasonable fitness, which partially explains the large variance in the brain complexity measures (see, e.g., outliers for G no-agent in Fig 13). However, given a certain requirement for memory and context sensitivity, constraints in the number of sensors and memory elements may give rise to an empirical lower boundary on the amount of integrated information necessary to perform a given task [20,24,38,39].
In summary, for a given MB architecture, higher brain complexity seems to be related to better performance and reliability. However, future work should explore under which environmental conditions additional sensors, or more internal units, become more advantageous for the evolution of higher fitness (EF) and reliability (R).

Limitations
Our work modeled one particular, small-scale scenario of a multi-agent evolutionary setting. Future work should consider other types of environments which may strengthen the generality of our results. Moreover, further evolution or training scenarios for artificial organisms should be considered as well-here we do not use crossover in the genetic algorithm, for example, and all animats placed in the same environment are clones. In addition, Markov Brains are just one type of computational substrate and it would be interesting to see whether other types of substrates (e.g. Artificial Neural Networks) behave differently under modified test conditions [40]. Nevertheless, the results obtained in our simulation study could also be directly compared against certain types of biological models (e.g. investigating the behavior of army ants under environmental modifications [36,37]).
While the measures that we employed to assess the complexity of the evolved MBs are theoretically motivated [19], they are also computationally very complex. This made it difficult to evaluate a larger sample size (number of evolution simulations) or to analyze the brain complexity of more generations (not only the final one). This is why alternative, approximate measures should be considered, too. For instance, the largest strongly connected component (and other graph metrics) can be used as a proxy for system integration and thus brain complexity [18]. Efficient approximations would also enable investigations into how brain complexity develops across generations as performed in [20] for slightly smaller MBs. Moreover, F Max , and the associated number of concepts #Concepts(F Max ), are causal measures that assess the degree to which the mechanisms within a MB are differentiated and integrated. Future work should also consider and explore alternative informational or dynamical measures (e.g., [41][42][43]). In this study, we concentrated on changes in task fitness and reliability under modified conditions, so the brain complexity analysis was not the subject of more in-depth investigation.

Conclusion
It is challenging to remain reliable in a dynamic and volatile world while also trying to succeed in a given task. Investigating the characteristics of this reliability, especially with regards to cooperative behavior, might also be useful to develop implications and strategies for improving the reliability of individuals within larger organizations. Despite complex dependencies between the individual, the group, and the environment, our computational approach offers a way to investigate reliability in group behavior. Here, we were particularly interested in the question of how cognitive and environmental constraints influence the reliability of simulated animats in a group. We were able to isolate essential influencing factors to better understand possible positive and negative effects of changing group size, environment design, and individual cognitive ability on reliability and task fitness under modified conditions. In particular, our study suggests that balancing the number of individuals in a group may lead to higher reliability under unforeseen changes in group size, even if the task itself would be simpler with fewer group members.
Moreover, a minimal number of sensors, the ability and incentive to distinguish static obstacles from other group members, and a minimal number of memory units were required to achieve high evolved fitness and reliability in our specific evolution simulations. If these minimal requirements were met, reliability R across group sizes was found to correlate with evolved fitness across the tested evolutionary setups. Limited sensor information forced the animats to evolve more complex brain structures, especially for intermediate group sizes, which also demonstrated the most reliable behavior across group sizes. Nevertheless, the highest task fitness across most modified conditions (varying group sizes as well as modified wallarrangements) was observed for the evolutionary setup with additional sensors, which did not require high internal complexity. Finally, we presented data that support the evolution of implicit cooperation between animats. In all, this research asserts that task efficiency and effectiveness is not the only goal in dynamic environments; task reliability is also worth striving for.

Materials and methods
We used an evolutionary algorithm to generate simulated animats evolving in groups under various evolutionary setups (see Table 1), testing different animat architectures and evolutionary conditions to evolve animats having heterogeneous behavior, evolved fitness, and reliability. Afterwards, we conducted post-evolutionary tests to assess the reliability of the different evolutionary setups under modified conditions (see Table 2). This section explains the animat designs, the environment, the evolutionary simulations, and the experiment setup. We used MABE (Modular Agent-Based Evolver) [44] as a computational evolution framework with the same parameters as in previous work [18] (see Table in S1 Table).
We chose MBs as a simplified model of an artificial brain, since the basic idea of an MB is to emulate the recurrent connectivity structure found in real neural networks in a simple manner, while being complex enough to represent a cognitive system [16]. Furthermore, a recent study showed that MBs can be very compatible against variations of artificial neural networks and even showed higher performance in general [17]. Nevertheless, it would, in principle, also be possible to use a finite state machine [21], or artificial neural networks [32] to solve the kind of task investigated here.
Individual animats had to solve a two-dimensional spatial-navigation task in the presence of other animats (clones), thus forcing individuals to react to these other animats in order to reach a high fitness value. This task was a redesign by Fischer et al. [18] of a task environment initially developed by Koenig et al. [21]. An animat can usually differentiate between static (borders and walls) and dynamic objects (animats) in the environment through two distinct sensors. This design allowed for the evolution of social behavior based on passive interactions between animats (we observed, e.g., "waiting", or "following" behavior).

Animat architecture
The evolutionary algorithm evolves animats with MBs, which contain a set of discrete, binary computational units ("neurons"). Each unit has its own update rules receiving inputs from and sending their output to other units. In this study, the decision system (the connectivity between units and their update- The integers in an animat's genome encode the HMGs: the number of HMGs, their lookup tables, the connected input units, and the connected output units. The MBs evolve by mutating the genome in each new generation (see [29,40]). Each locus in the genome mutated with a certain probability (point mutations). In addition, larger sections could be deleted or added to the genome [24,45] (again, all parameters are listed in Table in S1 Table). We did not use crossover or recombination (more than one parent per genome), since this would make it more difficult to trace an animat's line of descent without obvious computational advantages in the simple evolutionary setting investigated here. In principle, other optimization algorithms could be employed to develop well-performing MBs. The evolutionary algorithm used here has the advantage that both the node connectivity and the nodes' update rules can be encoded in the genome and jointly adapted through mutation and fitness selection. Reliability of simulated animats with group interaction All units in the animat's MB have binary states, either 1 or 0. A sensor turns 1 if an obstacle is detected and a motor switches to 1 if it is active. Two motors provide the ability to turn 90 degrees left or right, and to move forward (if both motors are in state 1). Since the units within a MB can be interconnected in a recurrent manner, they have the potential to create internal memory. We evolved animats with five different animat designs displayed in Fig 15. The baseline cognitive architecture was introduced already in [18] (one front wall sensor, one front agent sensor, four memory units, and two motors). Here, further deviations were designed to investigate the influence of an animat's sensorimotor and memory capacities on the resulting evolved fitness and the animats' task fitness and reliability under modified post-evolutionary test conditions. The sensors had a detection range of one unit. Typically, the motor units could also feedback to the memory and motor units, thus acting as additional memory capacity,  [18]. The animats have two motor units (grey triangles), four memory units (dark grey circles) and one to six sensor units (black/red shapes). The middle row shows animats with a changed sensor architecture, from the left: The architecture with sensors on three sides, the architecture with a single sensor unit, detecting wall and animat indiscriminately, and the architecture without an animat sensor. The bottom row shows animats with changed memory architecture, from the left: The architecture with only two memory units, the architecture with eight memory units and the architecture without feedback motors (motors cannot be part of the memory network). Note that the architectures depict the maximal number of units available. Whether any given unit is actually used depends on the evolved connectivity and logic function. Animats are initialized in the first generation without connections between units. since knowledge about previous motor states is directly available for computing the next state. One animat design was included that lacked the possibility for motor feedback (G no-feedback ).

Design of the 2D environment
All experiments simulated a two-dimensional environment. The world has 32×32 units (see Fig 16). All animats started on one of 72 predefined, uniformly distributed, starting positions. The selection for the starting position, as well as an animat's initial orientation, was random at every new generation. The original environment (see Fig 16(A)) had two rooms, which are connected by a gate. The animats' goal was to travel between the two rooms in order to achieve a high fitness value. This design was adapted from the work of Koenig et al. [21]. All evolutionary setups evolved in the original environment. As an additional test dimension for evaluating task fitness under modified conditions, we tested all evolved MBs (the final generation) in four modified environment designs (see Fig 16(B)-16(E)). Generally, animats were allowed to inhabit the same location in the environment (albeit under penalty, see below), except in G blocked and G blocked/no-penalty .

Experiment design
We selected G 0.50 to be the baseline setup for evolution, to which we compared all other evolutionary setups. This was because G 0.50 showed the highest reliability R across group sizes. In sum, we came up with 15 different setups for the evolution of the animats (see Table 1). Using the MABE framework, we simulated each evolutionary setup 30 times. In each of these 30 evolutions, the evolutionary algorithm had 10,000 generations to converge on the final solution. A population of 100 genomes was mutated and evaluated in each generation. Each of these evaluations was repeated 30 times (30 "test runs") with random starting positions, orientation, and selection order for simulating the animats movement serially. Random seeds were chosen using a Mersenne-Twister (mt19937) random number generator (see S2 Text for a more detailed explanation of the parameter sampling). After a genome was tested 30 times, it received a fitness score, which was computed based on the mean across the task performance of 30 single animats, with one being picked randomly from each of the 30 random test runs. In addition, in setup G random the group size varied for each of the 30 tests. The specific group size was drawn randomly from a vector ([1, 4, 7, 11, 14, 18, 22, 25, 29, 32, 36, 40, 43, 47, 50, 54,  58, 61, 65, 68, 72]). This vector simulates a uniform distribution between 1 and 72.

The simulated life
The fitness function F that determines the probability of a genome being reproduced depends on two factors. First, animats A have to travel as often as possible through the gate (change the room, see Fig 16). Second, the animats need to avoid colliding with each other. Fischer et al. [18] already included the formal definitions of the fitness function as a weighted sum of the penalty for collision and the reward for crossing the gate (see Table 5 for the mathematical notation of Eqs 4 and 5): The amount of reward (+1.0 points) is higher than the amount subtracted in the case of a penalty (-0.075 points). These numbers need to be chosen carefully. If the penalty is too low or the reward is too high, animats will keep moving from one room to the other through the gate (herding effect) and ignore the penalty. On the other hand, given a high penalty and low reward, animats will evolve hardly any movement. To further reduce the herding effect around the gate, there is a refractory period of 100 timesteps after receiving a reward before the same animat can receive another reward. Since each trial has a duration T of 500 timesteps, any one animat can receive a total fitness score of at most 4 points [18].
To investigate the coordination and cooperation of animats in groups, we let animats coexist in the same environment (in contrast to previous studies in this scope [16,19,24]). Currently, we have not implemented co-evolution of animats with different genomes and have only evaluated a genome by generating animats as identical clones (with the same MBs). There was no active knowledge exchange ("communication") between animats in this study. Animats had to develop the ability to distinguish which kind of sensory input to use for decision making. As specified above, sensors can only sense one position in front of-or on the side of (G 3sides )-the animat and differentiate between static objects (walls) and dynamic objects (fellow animats), except for G w = a .
Compared to the baseline setup, we included further evolutionary setups in which animats did not receive the collision penalty and/or were not able to overlap (G no-penalty , G blocked, G blocked/no-penalty ). Those changes in the fitness function represented environmental rules which influenced the task difficulty. As a result, we were able to test the role that the imposed interaction conditions between animats played in order to achieve high task fitness under modified conditions.

Table 5. Mathematical notation as used in the fitness function F(A) and f(a). a 2 A
A single animat a in the set of all animats A in a trial.

f(a)
The fitness of a single animat a.

F(A)
The average fitness of all animats in A as clones of a single genome.

rand(A)
Picks a random animat a from the group A.   g(a, t a , t b ) Returns the number of gate-crossings between time t a and time t b for a single animat a. where t 2 T and T = [1, 2, . . ., 499, 500].

c(x,y, t)
Returns the number of animats at a specific position (x,y) at time t. https://doi.org/10.1371/journal.pone.0228879.t005 Reliability of simulated animats with group interaction

Post-evolutionary evaluation
Modified conditions. Post-evolutionary task fitness tests were designed as follows: First, we selected the 30 genomes of generation 10,000 (10k) for each of the 15 evolutionary setups (see Table 1). Second, each genome was tested across 21 conditions varying in group size in the Original test condition. To this end, we created groups of animat clones of the respective test group size for each of the 30 � 15 genomes. Test group sizes were uniformly distributed between 1 and 72. The interval of the relative group sizes is [1, 4, 7, 11, 14, 18, 22, 25, 29, 32,  36, 40, 43, 47, 50, 54, 58, 61, 65, 68, 72]. A single animat is not a group, but we treat it as one in order to simplify notation.
In addition to varying group sizes in the baseline task design (Original), we created four modified test environments, as shown in Fig 16 (Noisy Corners, Small Gate, 4 Rooms, 4 Messy Rooms). Moreover, we included three additional test conditions in which we varied the interaction conditions of the animats (No Penalty, Blocked, Blocked and no penalty). Finally, we tested each of the 30×15×21 different configurations in each of the eight test environments.
For the statistical analysis and the main reliability evaluations, we defined a quantitative reliability measure R across group sizes in the Original environment design (see Eq 3 above). The modified test environments represented four independent samples of possible environmental modifications. For this reason, they were evaluated on their own in terms of the achieved task fitness TF. The results of the remaining three test conditions with varying interaction properties mainly served to highlight differences between the evolutionary setups, rather than testing reliability per se.
Brain complexity. To evaluate the complexity of the evolved MBs, we employed two complimentary measures provided by integrated information theory (IIT) [19,46], F Max and the associated number concepts #Concepts(F Max ). The core of IIT's measures is an information theoretic, and probabilistic graph analysis [19] based on the state-to-state transition probabilities of the units, i.e., their update functions. Please refer to [19,20] for details on the evaluation. Very briefly, to evaluate the integrated information F ("big phi") for a particular set of computational units S in state S = s, the first step is to assess which subsets Y�S specify positive integrated information φ>0 ("small phi") within the system (the set's "concepts"). φ captures how much a set of elements Y within the system in its state y constrains the prior and next states of other system subsets V t±1 �S. In simplified terms: where C partitionspðV t�1 jy t Þ into the product distributionpðV 1;t�1 jy 1;t Þ �pðV 2;t�1 jy 2;t Þ, and D is a distance measure between two probability distributions. The^("hat"-symbol) above the probability function p indicates that probabilities are interventional (obtained from system perturbations) rather than observational [19,47]. V t±1 are chosen such that φ(y t ) is maximal. Second, F is measured as the minimal difference that any system partition C S makes to the overall information specified by all subsets Y with φ(y t )>0. Again, in simplified terms: For a given MB, we search across all sets of computational units S for the one with F max = max S F. F Max represents the highest possible integrated information the MB can achieve across all its subsets, which we used as an indicator for brain complexity [19].
All calculations were conducted using the IIT Python package pyphi [22], which we used in our work to calculate F Max and the corresponding number of concepts. Since the employed measures are state-dependent, we evaluated F Max and the number of concepts for every state a MB experienced during a lifetime (one trial) and selected the maximum value over all states as in [20]. S1 Fig in Supporting Information shows by way of example that it is essential for high F Max in a system that many elements are integrated, meaning also maintaining functional feedback loops within the system. In this study, we only considered the brain complexity of the final generation (10k) due to the computational complexity of calculations using pyphi.
Statistics. The evolved fitness values EF, the reliability R, and the IIT brain complexity measures were statistically evaluated across all evolutionary setups using a Kruskal-Wallis test, which showed a significant difference of the observed statistics between all groups taken together. Further, we used the Mann-Whitney-U test to evaluate the difference between pairs of evolutionary setups. Tables A-G in S1 Text lists all statistical tests that are a subject of discussion in the results and discussion section.
Supporting information S1 Fig. Brain wiring diagram. (a). Best animat in evolution #4 under condition G random with an evolved fitness EF = 3.1 and F Max = 0. The network structure shows only few feedback loops, which cannot produce integrated information. (b) Best animat in evolution #1 under condition G random with an evolved fitness EF = 2.9 and F Max = 7.77. The network structure shows much more connections, which integrated the network states and makes them interdependent. (TIFF) S1 Table.