Survival games for humans and machines

(1) the Terrain agent performs above human level, while the Grid agent performs below human level. (2) the smell, touch, and interoception models contribute significantly to the performance of the Grid agent. (3) the memory model contributes significantly to the performance of the Grid agent; and (4) the performance of the Grid agent is relatively stable under three quite different reward signals, including one that rewards survival and nothing else.


Introduction
Comparing the performance of humans and machines at particular tasks has been a recurrent theme in philosophy, psychology, and AI.It is also a central question in artificial general intelligence (AGI) (Goertzel, 2014), given its ultimate goal of constructing machines that perform above the human level at virtually all tasks.There are AI models today that perform at the average human level or above at tasks such as chess (Campbell, Hoane, & Hsu, 2002), Jeopardy!(Ferrucci, 2012), image recognition (Dan, Ueli, Jonathan, & Jürgen, 2012;Le-Cun, Bengio, & Hinton, 2015), video games (Hessel et al., 2018), IQ tests (Strannegård, Amirghasemi and Ulfsbäcker, 2013;Strannegård, Cirillo & Ström, 2013), and the SAT test (Achiam et al., 2023).On the other hand, developing embodied AI that can operate in the real world at the human level has been a long-standing challenge, e.g., in the case of household robots (Shafi, Mohammed, Sheela, Muthumanickam, & Kumar, 2023) and autonomous driving (Yurtsever, Lambert, Carballo, & Takeda, 2020).Also in theoretical domains such as mathematics, humans continue to have the upper hand, despite recent progress in large language models (Achiam et al., 2023) and neuro-symbolic systems (Trinh, Wu, Le, He, & Luong, 2024).Furthermore, despite spectacular advances in generative AI, human creators continue to dominate the market of artistic domains such as creative writing (Chang et al., 2023), visual arts (Brooks et al., 2022;Liu et al., 2024;Zhang et al., 2022), and music (Agostinelli et al., 2023).In many of these domains, human-AI collaboration (Dellermann, Ebel, Söllner, & Leimeister, 2019) might be a powerful option.
A substantial research effort has gone into developing AI models that play games, for example vintage video games like Breakout (Hessel et al., 2018), strategic board games like go (Silver et al., 2016), and strategic video games like StarCraft II (Vinyals et al., 2019).A key paradigm behind these achievements is reinforcement learning (RL) (Sutton & Barto, 2018), in particular deep RL (Mnih et al., 2015), sometimes https://doi.org/10.1016/j.cogsys.2024.101235Received 27 July 2023; Received in revised form 8 March 2024; Accepted 5 April 2024 combined with Monte-Carlo Tree Search (Silver et al., 2016).One category of games that have been studied in this context are survival games, where the goal of the player is to survive by navigating in a digital world with resources, enemies, and obstacles of various kinds.Survival games might be interesting from the perspective of theoretical biology, given that animals must survive until maturity in their natural environments to be able to reproduce.One of the first deep RL models for playing Ms. Pac-Man performed only at 13% of human level (Mnih et al., 2015).Later deep RL models for Ms. Pac-Man have been able to perform at superhuman level, however (Van Seijen et al., 2017).On the other hand, human level play at survival games such as Minecraft (Duncan, 2011) and Crafter (Hafner, 2021) have been a longstanding challenge to AI research (Hafner, Pasukonis, Ba, & Lillicrap, 2023;Milani et al., 2020).
In this paper we compare human and machine performance on a particular type of survival game that we call pure, where (i) the only goal of the game is survival; (ii) the player has only partial information about the environment; and (iii) the environment is randomly generated at the start.Thus, in pure survival games there is no mission like collecting diamonds, no complete map of the entire environment, and the environment is essentially new to the player.Pure survival games are intended to model the core challenges facing animals in the real world, e.g., finding food and water, circumnavigating obstacles, and avoiding predators.No previous study has focused on comparing human and machine performance on such games, as far as we are aware.
The purpose of the present paper is to address this knowledge gap and thus contribute to the research on the relation between human and machine intelligence.Specifically we consider the following research questions: (1) To what extent can RL agents match human performance at pure survival games?(2) How do variations in the perceptual system affect the performance of the RL agents?(3) How do variations in the memory system affect the performance of the RL agents?(4) How do variations in the reward system affect the performance of the RL agents?

Survival games
Early examples of survival games include Pong from 1972, Space Invaders from 1978, and Ms. Pac-Man from 1982 (Mnih et al., 2015).
More recent examples are Minecraft from 2009 and Crafter from 2021.There is a multitude of algorithms for playing survival games, many of which are based on deep RL operating on Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP) (Sutton & Barto, 2018).These deep RL algorithms can be combined with curriculum learning (Bengio, Louradour, Collobert, & Weston, 2009), which divides the learning process into stages of increasingly difficult environments; continual learning, with agents that continue to learn throughout their lifetime (Khetarpal, Riemer, Rish, & Precup, 2022); and memory, e.g., in the form of Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997).
Building versatile RL agents might require training on large classes of MDPs, e.g., randomly generated MDPs, rather than a fixed MDP for each level of the game, for example.Randomly generated data are commonly used in order to increase generality and avoid overfitting, e.g., domain randomization in robotics (Chen, Hu, Jin, Li and Wang, 2021) and data augmentation in image processing (Shorten & Khoshgoftaar, 2019).Let us give a brief overview of some survival games that have been studied in the context of AI.
Atari.Deep RL models, for example Rainbow (Hessel et al., 2018) have been used for playing classic Atari games, including Pong, Space Invaders, and Ms. Pac-Man, at superhuman level.
CoinRun.Randomly generated levels were used for training deep RL models for the survival games CoinRun and CoinRun Platforms (Cobbe, Klimov, Hesse, Kim, & Schulman, 2019).The same paper also explored randomly generated Kruskal mazes, which are solvable with the algorithm ''move forward with the right hand constantly touching the wall''.

Minecraft.
Minecraft is set in a procedurally generated world that the player can explore freely, but only observe partially.Minecraft can be played in creative mode, where there are no goals and the player can build objects freely, as in a sandbox game without any goals other than creative play and entertainment.It can also be played in survival mode, where the players ''mine'' natural resources such as stone and iron and then ''craft'' items such as armor, swords, and pickaxes.The player must also stay alive by eating, sleeping, and avoiding being killed.The goal of Minecraft in survival mode is not only to survive, but also to accomplish various other achievements such as mining, building objects, and exploring.An optional goal is to defeat the socalled Ender Dragon.Several RL-based models for Minecraft struggle to reach human level play (Hafner et al., 2023;Milani et al., 2020).The recent Voyager model outperforms several of these algorithms in terms of the number of achievements reached (Wang et al., 2023).Voyager uses GPT-4 (Achiam et al., 2023) to produce code for controlling the Minecraft agent in an iterative process, where the prompt is being refined gradually.No comparison to human performance was reported in Wang et al. (2023).

Methods
In this section we describe our experiments with human players and RL agents on two pure survival games.

Participants
The participants in our experiment were eleven students from the University of Gothenburg and Chalmers University of Technology.They were all in the age span 20-30 years, and included five women and seven men.All participants had prior experience from playing survival games.They were recruited via ads that were posted on campus and they were offered snacks in return for taking part in the experiment.

Material
Two pure survival games were used in the experiment: a thirdperson game called the Grid game and a first-person game called the Terrain game.Both games used worlds that were randomly generated at test time.Grid game.The Grid game is played in randomly generated grid worlds of size 64 × 64 pixels, containing energy sources, water sources, and obstacles.A screenshot from the Grid game is shown in Fig. 1.The challenge of the player is to stay alive for 500 steps and not die from lack of resources earlier.This requires collecting at least ten units of energy and ten units of water.The agent consumes a fixed amount of energy and water at each time step.
Terrain game.The Terrain game is played in the game engine Unity.The environment consists of procedurally generated mountain-like terrain with trees, boulders, and two kinds of resources.The topography is randomly generated using Perlin noise (Perlin, 1985).The player controls a hare model that moves in the environment and consumes resources automatically when within one length unit distance to their exact location.The goal of the player is to stay alive for 1000 steps, which requires them to repeatedly consume resources of both types and navigate efficiently.In particular, they must avoid slopes that are too steep to climb, unnecessarily long paths, and unnecessarily steep slopes, since they consume more resources than necessary.A screenshot from the Terrain game is shown in Fig. 2.
The Terrain game takes place within the Unity game engine in a forest environment in a 100 × 100 × 20 length units, with the numbers representing the width, length and height, respectively.One length unit is considered to correspond to one meter as that is the default interpretation by the physics engine.The game objects are scaled by the same frame of reference.

Procedure
The experiment was conducted in a public space on the Lindholmen campus of the University of Gothenburg and Chalmers University of Technology.Each participant was assigned an individual laptop computer.A short practice session was conducted before the test to familiarize the participants with the two computer games.At the experiment, each participant played one of the games for ten minutes, then switched and played the other game for ten minutes.Half of the participants started with the Grid game and the other half with the Terrain game.Performance data in the form of survival times were recorded for each participant and each round of play.The ten participants played a total of 68 rounds on the Grid game and 55 rounds on the Terrain game.The Grid agent and the Terrain agent, which will be described in the next section, were also tested on worlds from the same probability distributions as the human players.

Performance measure
The score of a player  (a human or a machine) on a world  is the survival time of  in  divided by the max lifetime of  .For instance, the score of an agent that survived 400 steps in a Grid world would be 400∕500 = 80%.For each player, we computed the average score on the worlds that it was tested on.

AI agents
In this section we describe the Grid agent and the Terrain agent.

Perception
The Grid agent is equipped with a perceptual system that models several senses.
Vision.To model vision we use a 31 × 31 pixel image.Vision is egocentric in the sense that the agent is always located at the center of the image.Each image has four Boolean channels for obstacles, walls, and the two resources.It also has a float channel for representing the agent's last few positions.An example of an image is given in Fig. 3.
Smell.The agents can perceive the smell of the two types of resources.Each smell consists of a direction vector and an intensity scalar.These signals are computed as aggregations of the smells emanating from all resource objects.Touch.To model touch we use an egocentric 5 × 5 pixel image.
Positioning.To model magnetoreception we use an -coordinate and a -coordinate that specify the current position of the agent in the 64 × 64 world.
Interoception.The internal signals are the resource levels resource  , for 1 ≤  ≤ , with the number of resources  = 2 in our case.

Actions
The Grid agent has three actions for navigating the grid worlds: (i) move forward; (ii) turn left 90 degrees; and (iii) turn right 90 degrees.There is no action for consuming resources in the present model.Instead, the resources are being consumed automatically when the agent is at their exact location.

Reward system
We introduce a notion of utility, with the intuitive idea that the agents strive to maximize their utility in the long run (and survive as a side effect).Here is the definition of utility () at time : Note that this utility function has a law of diminishing returns, since it uses a log function.The following reward signal, which we call Homeostatic, Strannegård et al. (2022), is used for training our RL agents:

Memory
The agents have the following types of memory: • A Long Short-Term Memory (LSTM) • A Trail: a 31 × 31 image channel indicating the agent's last few positions, with higher color intensity representing more recently visited positions • A previous direction: an integer indicating the direction of the previous move: east, north, west, or south.

Decision-making
For decision-making we use a version of the actor-critic RL algorithm PPO (Schulman et al., 2017).Our neural network architecture is an elaboration of the so-called Nature architecture (Mnih et al., 2015), with extra inputs for the additional perceptual modalities and also an LSTM memory.The architecture also bears similarities to LSTM-SPCNN (Stanić et al., 2022), but uses fewer parameters.

Metabolism
Agents can metabolize resources of two kinds: energy and water.The internal resource levels are perceived by the agents and they are in the range [0, 5] for both energy and water.When the level of energy or water is at the max value 5, consuming more will have no effect.Unless the max value is reached, the agent gets an addition of energy (water) of +0.25 when consuming an item of energy (water).The agent also has a basic metabolic cost of 0.01 per step for energy and water.Moreover, there is also an additional energy cost for hitting obstacles or walls amounting to 0.1 (from ''bleeding'').

Death
The agent dies if it reaches its max age (500 steps), or some resource  reaches 0. Intuitively, the agents can die from old age, starvation, or thirst.

Training
We defined a notion of complexity for grid worlds in terms of the number of resources and obstacles.All agents were trained with curriculum learning in 25 stages, using random worlds of increasing complexity.At each stage, the policy network was trained with the StableBaselines3 (Raffin et al., 2021) implementation of the actor-critic algorithm PPO, mentioned above.We used the default discount rate 0.99 and learning rate 0.0003.Each episode used a previously unseen random world and lasted as long as the agent survived, i.e., until it reached its max age or died from lack of resources.

Terrain agent
The interoception, reward system and decision-making mechanism of the Terrain game are similar to those of the Grid game.

Perception
Vision.We use a 40 × 20 pixel image representing the perspective projection of objects fully or partially enclosed within a vision cone defined by longitudinal and latitudinal viewing angles of 40 and 20 degrees, respectively, with a maximum viewing distance of 40 m.The vision cone emanates from between the hare model's eyes, located 60 centimetres above its paws.Each image has four continuous channels for obstacles, terrain, and the two resources, respectively.Each pixel takes a value in [0, 1], representing the distance to the nearest volumetric pixel mapping to that particular 2D pixel of the retina.The corresponding pixel on the other channels are set to 1, which is regarded as infinity by the agent.Fig. 4 shows an example of such a ''retina'' image.
Touch.To model touch we take note of where the agent's collider (a geometrical shape around the agent) intersects with the collider(s) of one or more obstacles, in up to a maximum of two points, expressed in the local coordinate system of the agent.In cases of fewer intersections, the touch sensory vector is padded with one or two zero vectors.Further, to model touch and feel of the paws, the sensory vector is appended with a Boolean value that tells whether the agent touches ground and lastly the normalized euclidean normal vector at the point where the vertical center line of the agent's collider intersects with the terrain, irrespective of whether the agent is on the ground or in the air, expressed in local coordinates.
C. Strannegård et al.Proprioception.Proprioception is the perception of the position and the movement of the body.The agent has access to its own velocity vector, which is expressed in local coordinates relative to the agent.The agent is also aware of its vertical gaze angle, which lies in the interval −45 to 45 degrees in discrete steps of 7.5 degrees.

Actions
The player can chose between three actions: • Apply a force vector at the center of gravity of the agent's collider, along the -axis in the local orthogonal coordinate system described by the normalized planar projections of the x-and z-axes of the agent's local coordinate system onto the plane perpendicular to the terrain normal at the intersecting point where the terrain and the extended vertical center line of the agent's collider meet, plus the normalized aforementioned normal vector.
Informally, the action can be expressed as ''move forwards or backwards from where the agent's nose would point if it were to rigidly orient its vertical posture to coincide with the ground normal''.The magnitude of the force vector has to be in the continuous range 0 to 75.Note that the agent does not act upon this action while in the air.Also note that rotational momentum that might occur by applying the vector has been configured to be ignored by the physics engine.• Turn clockwise or counterclockwise with an angular velocity in the continuous range 0 to 270 degrees/s.• Tilt the gaze upwards or downwards 7.5 degrees.Do nothing if the updated tilt would exceed 45 degrees in either direction.
Note that there is no action for consuming resources.Instead, the resources are automatically consumed (ingested) when the agent is within one length unit distance to their exact location.

Memory
The agents are equipped with an LSTM memory and stacked observations from the previous three time steps.

Metabolism
Agents can consume and spend resources of two kinds (white and blue).One may think of the resources as energy and water.The continuous resource ranges are [0, 1] for each kind of resource.When the level is at the max value 1, consuming more will have no effect.Consuming a resource will always fully replenish its corresponding resource.The agents have a basic metabolic rate of 0.005 per decision step for each resource.Also, for each resource, the cost of movement  is defined by the equation where  is the velocity vector,  is the angular velocity scalar ,   = 75 and   = 270.There is also an additional cost for colliding with obstacles (from injury), which exclusively affects the energy level.

Death
The agent dies if it reaches its max age (1000 steps), or some resource  reaches 0. Intuitively, the agents can die from old age, starvation, injury or thirst.

Training
We defined a notion of complexity for terrain worlds in terms of the number of resources, obstacles and competitors.The Perlin noise scale (Perlin, 1985), measuring the general steepness and undulation of the landscape, is also part of the complexity measure.All agents were trained with curriculum learning in several stages, using random worlds of increasing complexity, where the final training stage had 500 trees, 50 energy sources, 50 water sources, 9 competing agents and Perlin noise scale set to 4. In comparison, the first training stage had 100 trees,   (Schulman et al., 2017).We used the discount rate 0.995 and learning rate 0.00003.Each training stage used a previously unseen random world.

Results
In this section we present the results of the experiments described in Section 3. We also show how variations of the Grid agent with respect to perception, memory, and reward influence its performance.Each variation of the Grid agent was tested in 12 runs, and for each run, we selected the model that achieved the highest life expectancy on the final environment during training.For each variation, the life expectancy was set to the mean value of the best performing models across each run.

Humans versus machines
Fig. 5 shows the performance of the test participants and the RL agents on the Grid game and the Terrain game.The data indicate that the Grid agent performs below human level and the Terrain agent above human level.

Perceptual variations
The learning curves of five perceptual variations of the Grid agent are shown in Fig. 6 and the performance of the best agents of each variation that emerged in the training process is shown in Table 1.It is evident from these data that the performance remains unaffected by the absence of GPS.However, the absence of touch, smell, or interoception significantly impairs performance.

Memory variations
The learning curves of four memory variations of the Grid agent are shown in Fig. 7 and the performance of the best agents of each variation that emerged in the training process is shown in Table 2.These data clearly demonstrate a significant performance difference between the variations with and without the LSTM memory.

Reward variations
In addition to the reward signal Homeostatic defined in Eq. ( 2), we explored two other reward signals: • Classic: +1 awarded when consuming a resource of any kind and −1 when hitting an obstacle • Heartbeat : +1 awarded at each time step when alive.
The learning curves of the three variations of the Grid agent with these different reward signals are shown in Fig. 8 the performance of the best agents is presented in Table 3.At the end of the training, it is evident that all three variations show comparable performance, with each variation falling within one standard deviation of the others.

Humans vs. machines
In general, it is not straightforward to compare the performance of humans and AI agents.For example, humans typically need to input their decisions mechanically via a keyboard or a joystick, while AI

Fig. 1 .
Fig. 1.Screenshot from the Grid game.The player's position and orientation is represented by the red arrow, while the green and blue squares represent, respectively, energy and water resources.The black rectangles represent walls.The size of the entire world is 64 × 64 pixels.The player can only see a subset of the world consisting of 31 × 31 pixels.The green and blue arrows indicate the direction of the aggregated smell of energy and water, respectively, and the bars on the right display the intensity of the smell.

Fig. 2 .
Fig. 2. Screenshot showing the first person perspective view of a player controlled hare model, in the Terrain game.The health bars in the upper left corner represent the level of energy (red) and water (blue) pertaining to the player.Blue lit flowers represent water sources, while red lit flowers represent energy sources.A computer controlled hare model (with its associated health bars above it) competing for the resources can be seen in the middle left part of the picture.

Fig. 3 .
Fig. 3. Example of a perception image of size 31 × 31 with yellow obstacles, blue walls, and red and green resources.The agent is located at the center in the position (15, 15), facing right.There is also a memory trail in shades of orange, representing the agent's last few positions.

Fig. 4 .
Fig. 4.An example of the visual perception of an agent in the Terrain World.Red tree stems, green ground, white energy sources, and blue water sources can be seen.Objects appear brighter the closer they are.

Fig. 5 .
Fig. 5. Boxplots showing the performance on the Grid game (Left) and the Terrain game (Right) with previously unseen test worlds of the human participants, the RL agents, and agents making random actions.Each blue dot represents the average performance of a test participant playing several rounds of the game.

Fig. 6 .
Fig. 6.Learning curves of five perceptual variations of the Grid agent.The learning curves of all agents start at time step 0 with life expectancy 0.1 (average performance with random actions).

Fig. 7 .
Fig. 7. Learning curves of four memory variations of the Grid agent.The three agents with an LSTM perform much better than the agent without an LSTM.

Table 1
Life expectancy with standard deviation of Grid agents trained with different perceptual variations and tested on several previously unseen worlds.The data indicate that agents with smell, interoception, and touch have a significant advantage in terms of life expectancy.

Table 2
Life expectancy with standard deviation of Grid agents trained with different memory variations and tested on several previously unseen worlds.The data indicate that agents with an LSTM memory perform significantly better than the agent without an LSTM.They also indicate that other types of memory are of little additional help in the presence of an LSTM.Perlin noise scale set to 0, i.e. a flat surface.At each stage, the policy network was trained with the PyTorch implementation of the actor-critic algorithm PPO

Table 3
Life expectancy with standard deviation of Grid agents trained with different reward variations and tested on several previously unseen worlds.Learning curves of three reward variations of the Grid agent.The agent with Heartbeat reward requires longer training time, but after four million training steps, all agents reach a similar performance level.