A Survey on Large Language Model based Autonomous Agents

Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at https://github.com/Paitesanshi/LLM-Agent-Survey.


Introduction
"An autonomous agent is a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future." Franklin and Graesser (1997) Autonomous agents have long been recognized as a promising approach to achieving artificial general intelligence (AGI), which is expected to accomplish tasks through self-directed planning and actions.
In previous studies, the agents are assumed to act based on simple and heuristic policy functions, and learned in isolated and restricted environments [113,96,134,60,11,127].Such assumptions significantly differs from the human learning process, since the human mind is highly complex, and individuals can learn from a much wider variety of environments.Because of these gaps, the agents obtained from the previous studies are usually far from replicating human-level decision processes, especially in unconstrained, open-domain settings.
Figure 1: Illustration of the growth trend in the field of LLM-based autonomous agents.We present the cumulative number of papers published from January 2021 to August 2023.We assign different colors to represent various agent categories.For example, a game agent aims to simulate a gameplayer, while a tool agent mainly focuses on tool using.For each time period, we provide a curated list of studies with diverse agent categories.
In recent years, large language models (LLMs) have achieved notable successes, demonstrating significant potential in attaining human-like intelligence [120,127,11,4,146,147].This capability arises from leveraging comprehensive training datasets alongside a substantial number of model parameters.Building upon this capability, there has been a growing research area that employs LLMs as central controllers to construct autonomous agents to obtain human-like decision-making capabilities [21,139,138,126,133,184,136].Along this direction, researchers have developed numerous promising models (see Figure 1 for an overview of this field), where the key idea is to equip LLMs with crucial human capabilities like memory and planning to make them behave like humans and complete various tasks effectively.Previously, these models were proposed independently, with limited efforts made to summarize and compare them holistically.However, we believe a systematic summary on this rapidly developing field is of great significance to comprehensively understand it and benefit to inspire future research.
In this paper, we conduct a comprehensive survey of the field of LLM-based autonomous agents.Specifically, we organize our survey based on three aspects including the construction, application, and evaluation of LLM-based autonomous agents.For the agent construction, we focus on two problems, that is, (1) how to design the agent architecture to better leverage LLMs, and (2) how to inspire and enhance the agent capability to complete different tasks.Intuitively, the first problem aims to build the hardware fundamentals for the agent, while the second problem focus on providing the agent with software resources.For the first problem, we present a unified agent framework, which can encompass most of the previous studies.For the second problem, we provide a summary on the commonly-used strategies for agents' capability acquisition.In addition to discussing agent construction, we also provide an overview of the applications of LLM-based autonomous agents in social science, natural science, and engineering.Finally, we delve into the strategies for evaluating LLM-based autonomous agents, focusing on both subjective and objective strategies.
In summary, this survey conducts a systematic review and establishes comprehensive taxonomies for existing studies in the field of LLM-based autonomous agents.We focus on three aspects: agent construction, application, and evaluation.Drawing from previous studies, we identify various challenges in this field and discuss potential future directions.We believe that this field is still in its early stages; hence, we maintain a repository to keep track of ongoing studies at https://github.com/Paitesanshi/LLM-Agent-Survey.We expect that our survey can provide newcomers to the field of LLM-based autonomous agents with a comprehensive background knowledge, and also encourage further groundbreaking studies.
Autonomous agents typically perform tasks by assuming specific roles, such as coders, teachers and domain experts [124,39].The profiling module aims to indicate the profiles of the agent roles, which are usually written into the prompt to influence the LLM behaviors.Agent profiles typically encompass basic information such as age, gender, and career [121], as well as psychology information, reflecting the personalities of the agents [149], and social information, detailing the relationships between agents [149].The choice of information to profile the agent is largely determined by the specific application scenarios.For instance, if the application aims to study human cognitive process, then the psychology information becomes pivotal.After identifying the types of profile information, the next important problem is to create specific profiles for the agents.Existing literature commonly employs the following three strategies.
Handcrafting Method: in this method, agent profiles are manually specified.For instance, if one would like to design agents with different personalities, he can use "you are an outgoing person" or "you are an introverted person" to profile the agent.The handcrafting method has been leveraged in a lot of previous work to indicate the agent profiles.For example, Generative Agent [176] describes the agent by the information like name, objectives, and relationships with other agents.MetaGPT [64], ChatDev [124], and Self-collaboration [33] predefine various roles and their corresponding responsibilities in software development, manually assigning distinct profiles to each agent to facilitate collaboration.PTLLM [131] aims to explore and quantify personality traits displayed in texts generated by LLMs.This method guides LLMs in generating diverse responses by manfully defining various agent characters through the use of personality assessment tools such as IPIP-NEO [77] and BFI [76].[31] studies the toxicity of the LLM output by manually prompting LLMs with different roles, such as politicians, journalists and businesspersons.In general, the handcrafting method is very flexible, since one can assign any profile information to the agents.However, it can be also labor-intensive, particularly when dealing with a large number of agents.LLM-generation Method: in this method, agent profiles are automatically generated based on LLMs.Typically, it begins by indicating the profile generation rules, elucidating the composition and attributes of the agent profiles within the target population.Then, one can optionally specify several seed agent profiles to serve as few-shot examples.At last, LLMs are leveraged to generate all the agent profiles.For example, RecAgent [150] first creates seed profiles for a few number of agents by manually crafting their backgrounds like age, gender, personal traits, and movie preferences.
Then, it leverages ChatGPT to generate more agent profiles based on the seed information.The LLM-generation method can save significant time when the number of agents is large, but it may lack precise control over the generated profiles.
Dataset Alignment Method: in this method, the agent profiles are obtained from real-world datasets.Typically, one can first organize the information about real humans in the datasets into natural language prompts, and then leverage it to profile the agents.For instance, in [5], the authors assign roles to GPT-3 based on the demographic backgrounds (such as race/ethnicity, gender, age, and state of residence) of participants in the American National Election Studies (ANES).They subsequently investigate whether GPT-3 can produce similar results to those of real humans.The dataset alignment method accurately captures the attributes of the real population, thereby making the agent behaviors more meaningful and reflective of real-world scenarios.
Remark.While most of the previous work leverage the above profile generation strategies independently, we argue that combining them may yield additional benefits.For example, in order to predict social developments via agent simulation, one can leverage real-world datasets to profile a subset of the agents, thereby accurately reflecting the current social status.Subsequently, roles that do not exist in the real world but may emerge in the future can be manually assigned to the other agents, enabling the prediction of future social development.The profile module serves as the foundation for agent design, exerting significant influence on the agent memorization, planning, and action procedures.

Memory Module
The memory module plays a very important role in the agent architecture design.It stores information perceived from the environment and leverages the recorded memories to facilitate future actions.The memory module can help the agent to accumulate experiences, self-evolve, and behave in a more consistent, reasonable, and effective manner.This section provides a comprehensive overview of the memory module, focusing on its structures, formats, and operations.
Memory Structures: LLM-based autonomous agents usually incorporate principles and mechanisms derived from cognitive science research on human memory processes.Human memory follows a general progression from sensory memory that registers perceptual inputs, to short-term memory that maintains information transiently, to long-term memory that consolidates information over extended periods.When designing the agent memory structures, researchers take inspiration from these aspects of human memory.In specific, short-term memory is analogous to the input information within the context window constrained by the transformer architecture.Long-term memory resembles the external vector storage that agents can rapidly query and retrieve from as needed.In the following, we introduce two commonly used memory structures based on the short-and long-term memories.
• Unified Memory.This structure only simulates the human shot-term memory, which is usually realized by in-context learning, and the memory information is directly written into the prompts.For example, RLP [54] is a conversation agent, which maintains internal states for the speaker and listener.During each round of conversation, these states serve as LLM prompts, functioning as the agent's short-term memory.SayPlan [129] is an embodied agent specifically designed for task planning.In this agent, the scene graphs and environment feedback serve as the agent's short-term memory, guiding its actions.CALYPSO [183] is an agent designed for the game Dungeons & Dragons, which can assist Dungeon Masters in the creation and narration of stories.Its short-term memory is built upon scene descriptions, monster information, and previous summaries.DEPS [154] is also a game agent, but it is developed for Minecraft.The agent initially generates task plans and then utilizes them to prompt LLMs, which in turn produce actions to complete the task.These plans can be deemed as the agent's short-term memory.In practice, implementing short-term memory is straightforward and can enhance an agent's ability to perceive recent or contextually sensitive behaviors and observations.
• Hybrid Memory.This structure explicitly models the human short-term and long-term memories.The short-term memory temporarily buffers recent perceptions, while long-term memory consolidates important information over time.For instance, Generative Agent [121] employs a hybrid memory structure to facilitate agent behaviors.The short-term memory contains the context information about the agent current situations, while the long-term memory stores the agent past behaviors and thoughts, which can be retrieved according to the current events.AgentSims [99] also implements a hybrid memory architecture.The information provided in the prompt can be considered as short-term memory.In order to enhance the storage capacity of memory, the authors propose a long-term memory system that utilizes a vector database, facilitating efficient storage and retrieval.Specifically, the agent's daily memories are encoded as embeddings and stored in the vector database.If the agent needs to recall its previous memories, the long-term memory system retrieves relevant information using embedding similarities.This process can improve the consistency of the agent's behavior.In GITM [184], the short-term memory stores the current trajectory, and the long-term memory saves reference plans summarized from successful prior trajectories.Long-term memory provides stable knowledge, while short-term memory allows flexible planning.Reflexion [139] utilizes a short-term sliding window to capture recent feedback and incorporates persistent longterm storage to retain condensed insights.This combination allows for the utilization of both detailed immediate experiences and high-level abstractions.SCM [92] selectively activates the most relevant long-term knowledge to combine with short-term memory, enabling reasoning over complex contextual dialogues.SimplyRetrieve [117] utilizes user queries as short-term memory and stores long-term memory using external knowledge bases.This design enhances the model accuracy while guaranteeing user privacy.MemorySandbox [72] implements long-term and short-term memory by utilizing a 2D canvas to store memory objects, which can then be accessed throughout various conversations.Users can create multiple conversations with different agents on the same canvas, facilitating the sharing of memory objects through a simple drag-and-drop interface.In practice, integrating both short-term and long-term memories can enhance an agent's ability for long-range reasoning and accumulation of valuable experiences, which are crucial for accomplishing tasks in complex environments.Remark.Careful readers may find that there may also exist another type of memory structure, that is, only based on the long-term memory.However, we find that such type of memory is rarely documented in the literature.Our speculation is that the agents are always situated in continuous and dynamic environments, with consecutive actions displaying a high correlation.Therefore, the capture of short-term memory is very important and usually cannot be disregarded.
Memory Formats: In addition to the memory structure, another perspective to analyze the memory module is based on the formats of the memory storage medium, for example, natural language memory or embedding memory.Different memory formats possess distinct strengths and are suitable for various applications.In the following, we introduce several representative memory formats.
• Natural Languages.In this format, memory information such as the agent behaviors and observations are directly described using raw natural language.This format possesses several strengths.Firstly, the memory information can be expressed in a flexible and understandable manner.Moreover, it retains rich semantic information that can provide comprehensive signals to guide agent behaviors.In the previous work, Reflexion [139] stores experiential feedback in natural language within a sliding window.Voyager [148] employs natural language descriptions to represent skills within the Minecraft game, which are directly stored in memory.
• Embeddings.In this format, memory information is encoded into embedding vectors, which can enhance the memory retrieval and reading efficiency.For instance, MemoryBank [179] encodes each memory segment into an embedding vector, which creates an indexed corpus for retrieval.GITM [184] represents reference plans as embeddings to facilitate matching and reuse.Furthermore, ChatDev [124] encodes dialogue history into vectors for retrieval.
• Databases.In this format, memory information is stored in databases, allowing the agent to manipulate memories efficiently and comprehensively.For example, ChatDB [67] uses a database as a symbolic memory module.The agent can utilize SQL statements to precisely add, delete, and revise the memory information.In DB-GPT [182], the memory module is constructed based on a database.To more intuitively operate the memory information, the agents are fine-tuned to understand and execute SQL queries, enabling them to interact with databases using natural language directly.
• Structured Lists.In this format, memory information is organized into lists, and the semantic of memory can be conveyed in an efficient and concise manner.For instance, GITM [184] stores action lists for sub-goals in a hierarchical tree structure.The hierarchical structure explicitly captures the relationships between goals and corresponding plans.RET-LLM [114] initially converts natural language sentences into triplet phrases, and subsequently stores them in memory.
Remark.Here we only show several representative memory formats, but it is important to note that there are many uncovered ones, such as the programming code used by [148].Moreover, it should be emphasized that these formats are not mutually exclusive; many models incorporate multiple formats to concurrently harness their respective benefits.A notable example is the memory module of GITM [184], which utilizes a key-value list structure.In this structure, the keys are represented by embedding vectors, while the values consist of raw natural languages.The use of embedding vectors allows for efficient retrieval of memory records.By utilizing natural languages, the memory contents become highly comprehensive, enabling more informed agent actions.
Above, we mainly discuss the internal designs of the memory module.In the following, we turn our focus to memory operations, which are used to interact with external environments.

Memory Operations:
The memory module plays a critical role in allowing the agent to acquire, accumulate, and utilize significant knowledge by interacting with the environment.The interaction between the agent and the environment is accomplished through three crucial memory operations: memory reading, memory writing, and memory reflection.In the following, we introduce these operations more in detail.
• Memory Reading.The objective of memory reading is to extract meaningful information from memory to enhance the agent's actions.For example, using the previously successful actions to achieve similar goals [184].The key of memory reading lies in how to extract valuable information.Usually, there three commonly used criteria for information extraction, that is, the recency, relevance, and importance [121].Memories that are more recent, relevant, and important are more likely to be extracted.Formally, we conclude the following equation from existing literature for memory information extraction: where q is the query, for example, the task that the agent should address or the context in which the agent is situated.M is the set of all memories.s rec (•), s rel (•) and s imp (•) are the scoring functions for measuring the recency, relevance, and importance of the memory m.These scoring functions can be implemented using various methods, for example, s rel (q, m) can be realized based on LSH, ANNOY, HNSW, FAISS and so on † .It should be noted that s imp only reflects the characters of the memory itself, thus it is unrelated to the query q. α, β and γ are balancing parameters.By assigning them with different values, one can obtain various memory reading strategies.For example, by setting α = γ = 0, many studies [114,184,148,54] only consider the relevance score s rel for memory reading.By assigning α = β = γ = 1.0, [121] equally weights all the above three metrics to extract information from the memory.
• Memory Writing.The purpose of memory writing is to store information about the perceived environment in memory.Storing valuable information in memory provides a foundation for retrieving informative memories in the future, enabling the agent to act more efficiently and rationally.During the memory writing process, there are two potential problems that should be carefully addressed.On one hand, it is crucial to address how to store information that is similar to existing memories (i.e., memory duplicated).On the other hand, it is important to consider how to remove information when the memory reaches its storage limit (i.e., memory overflow).In the following, we discuss these problems more in detail.(1) Memory Duplicated.To incorporate similar information, people have developed various methods for integrating new and previous records.For instance, in [120], the successful action sequences related to the same sub-goal are stored in a list.Once the size of the list reaches N(=5), all the sequences in it are condensed into a unified plan solution using LLMs.The original sequences in the memory are replaced with the newly generated one.Augmented LLM [135] aggregates duplicate information via count accumulation, avoiding redundant storage.(2) Memory Overflow.In order to write information into the memory when it is full, people design different methods to delete existing information to continue the memorizing process.For example, in ChatDB [67], memories can be explicitly deleted based on user commands.RET-LLM [114] uses a fixed-size buffer for memory, overwriting the oldest entries in a first-in-first-out (FIFO) manner.
• Memory Reflection.Memory reflection emulates humans' ability to witness and evaluate their own cognitive, emotional, and behavioral processes.When adapted to agents, the objective is to provide agents with the capability to independently summarize and infer more abstract, complex and high-level information.More specifically, in Generative Agent [121], the agent has the capability to summarize its past experiences stored in memory into broader and more abstract insights.To begin with, the agent generates three key questions based on its recent memories.Then, these questions are used to query the memory to obtain relevant information.Building upon the acquired information, the agent generates five insights, which reflect the agent high-level ideas.For example, the low-level memories "Klaus Mueller is writing a research paper", "Klaus Mueller is engaging with a librarian to further his research", and "Klaus Mueller is conversing with Ayesha Khan about his research" can induce the high-level insight "Klaus Mueller is dedicated to his research".In addition, the reflection process can occur hierarchically, meaning that the insights can be generated based on existing insights.In GITM [184], the actions that successfully accomplish the sub-goals are stored in a list.When the list contains more than five elements, the agent summarizes them into a common and abstract pattern and replaces all the elements.In ExpeL [177], two approaches are introduced for the agent to acquire reflection.Firstly, the agent compares successful or failed trajectories within the same task.Secondly, the agent learns from a collection of successful trajectories to gain experiences.
A significant distinction between traditional LLMs and the agents is that the latter must possess the capability to learn and complete tasks in dynamic environments.If we consider the memory module as responsible for managing the agents' past behaviors, it becomes essential to have another significant module that can assist the agents in planning their future actions.In the following, we present an overview of how researchers design the planning module.

Planning Module
When faced with a complex task, humans tend to deconstruct it into simpler subtasks and solve them individually.The planning module aims to empower the agents with such human capability, which is expected to make the agent behave more reasonably, powerfully, and reliably.In specific, we summarize existing studies based on whether the agent can receive feedback in the planing process, which are detailed as follows: Planning without Feedback: In this method, the agents do not receive feedback that can influence its future behaviors after taking actions.In the following, we present several representative strategies.
• Single-path Reasoning.In this strategy, the final task is decomposed into several intermediate steps.These steps are connected in a cascading manner, with each step leading to only one subsequent step.LLMs follow these steps to achieve the final goal.Specifically, Chain of Thought (CoT) [155] proposes inputting reasoning steps for solving complex problems into the prompt.These steps serve as examples to inspire LLMs to plan and act in a step-by-step manner.In this method, the plans are created based on the inspiration from the examples in the prompts.Zero-shot-CoT [82] enables LLMs to generate task reasoning processes by prompting them with trigger sentences like "think step by step".Unlike CoT, this method does not incorporate reasoning steps as examples in the prompts.

ToT，LMZSP，RAP
Step-1 Step-2 Step-2 Step-2 Step-3 Step-3 Step-3 Step-3 Re-Prompting [128] involves checking whether each step meets the necessary prerequisites before generating a plan.If a step fails to meet the prerequisites, it introduces a prerequisite error message and prompts the LLM to regenerate the plan.ReWOO [164] introduces a paradigm of separating plans from external observations, where the agents first generate plans and obtain observations independently, and then combine them together to derive the final results.HuggingGPT [138] first decomposes the task into many sub-goals, and then solves each of them based on Huggingface.Different from CoT and Zero-shot-CoT, which outcome all the reasoning steps in a one-shot manner, ReWOO and HuggingGPT produce the results by accessing LLMs multiply times recursively.
• Multi-path Reasoning.In this strategy, the reasoning steps for generating the final plans are organized into a tree-like structure.Each intermediate step may have multiple subsequent steps.This approach is analogous to human thinking, as individuals may have multiple choices at each reasoning step.In specific, Self-consistent CoT (CoT-SC) [151] believes that each complex problem has multiple ways of thinking to deduce the final answer.Thus, it starts by employing CoT to generate various reasoning paths and corresponding answers.Subsequently, the answer with the highest frequency is chosen as the final output.Tree of Thoughts (ToT) [169] is designed to generate plans using a tree-like reasoning structure.In this approach, each node in the tree represents a "thought," which corresponds to an intermediate reasoning step.The selection of these intermediate steps is based on the evaluation of LLMs.The final plan is generated using either the breadth-first search (BFS) or depth-first search (DFS) strategy.Comparing with CoT-SC, which generates all the planed steps together, ToT needs to query LLMs for each reasoning step.In RecMind [152], the authors designed a self-inspiring mechanism, where the discarded historical information in the planning process is also leveraged to derive new reasoning steps.In GoT [8], the authors expand the tree-like reasoning structure in ToT to graph structures, resulting in more powerful prompting strategies.In AoT [137], the authors design a novel method to enhance the reasoning processes of LLMs by incorporating algorithmic examples into the prompts.Remarkably, this method only needs to query LLMs for only one or a few times.In [70], the LLMs are leveraged as zero-shot planners.At each planning step, they first generate multiple possible next steps, and then determine the final one based on their distances to admissible actions.[58] further improves [70] by incorporating examples that are similar to the queries in the prompts.RAP [62] builds a world model to simulate the potential benefits of different plans based on Monte Carlo Tree Search (MCTS), and then, the final plan is generated by aggregating multiple MCTS iterations.To enhance comprehension, we provide an illustration comparing the strategies of single-path and multi-path reasoning in Figure 3.
• External Planner.Despite the demonstrated power of LLMs in zero-shot planning, effectively generating plans for domain-specific problems remains highly challenging.To address this challenge, researchers turn to external planners.These tools are well-developed and employ efficient search algorithms to rapidly identify correct, or even optimal, plans.In specific, LLM+P [100] first transforms the task descriptions into formal Planning Domain Definition Languages (PDDL), and then it uses an external planner to deal with the PDDL.Finally, the generated results are transformed back into natural language by LLMs.Similarly, LLM-DP [26] utilizes LLMs to convert the observations, the current world state, and the target objectives into PDDL.Subsequently, this transformed data is passed to an external planner, which efficiently determines the final action sequence.CO-LLM [176] demonstrates that LLMs is good at generating high-level plans, but struggle with low-level control.To address this limitation, a heuristically designed external low-level planner is employed to effectively execute actions based on high-level plans.
Planning with Feedback: In many real-world scenarios, the agents need to make long-horizon planning to solve complex tasks.When facing these tasks, the above planning modules without feedback can be less effective due to the following reasons: firstly, generating a flawless plan directly from the beginning is extremely difficult as it needs to consider various complex preconditions.As a result, simply following the initial plan often leads to failure.Moreover, the execution of the plan may be hindered by unpredictable transition dynamics, rendering the initial plan non-executable.Simultaneously, when examining how humans tackle complex tasks, we find that individuals may iteratively make and revise their plans based on external feedback.To simulate such human capability, researchers have designed many planning modules, where the agent can receive feedback after taking actions.The feedback can be obtained from the environments, humans, and models, which are detailed in the following.
• Environmental Feedback.This feedback is obtained from the objective world or virtual environment.For instance, it could be the game's task completion signals or the observations made after the agent takes an action.In specific, ReAct [170] proposes constructing prompts using thought-actobservation triplets.The thought component aims to facilitate high-level reasoning and planning for guiding agent behaviors.The act represents a specific action taken by the agent.The observation corresponds to the outcome of the action, acquired through external feedback, such as search engine results.The next thought is influenced by the previous observations, which makes the generated plans more adaptive to the environment.Voyager [148] makes plans by incorporating three types of environment feedback including the intermediate progress of program execution, the execution error and self-verification results.These signals can help the agent to make better plans for the next action.Similar to Voyager, Ghost [184] also incorporates feedback into the reasoning and action taking processes.This feedback encompasses the environment states as well as the success and failure information for each executed action.SayPlan [129] leverages environmental feedback derived from a scene graph simulator to validate and refine its strategic formulations.This simulator is adept at discerning the outcomes and state transitions subsequent to agent actions, facilitating SayPlan's iterative recalibration of its strategies until a viable plan is ascertained.In DEPS [154], the authors argue that solely providing information about the completion of a task is often inadequate for correcting planning errors.Therefore, they propose informing the agent about the detail reasons for task failure, allowing them to more effectively revise their plans.LLM-Planner [141] introduces a grounded re-planning algorithm that dynamically updates plans generated by LLMs when encountering object mismatches and unattainable plans during task completion.Inner Monologue [71] provides three types of feedback to the agent after it takes actions: (1) whether the task is successfully completed, (2) passive scene descriptions, and (3) active scene descriptions.The former two are generated from the environments, which makes the agent actions more practical and reasonable.
• Human Feedback.In addition to obtaining feedback from the environment, directly interacting with humans is also a very intuitive strategy to enhance the agent planning capability.The human feedback is a subjective signal.It can effectively make the agent align with the human values and preferences, and also help to alleviate the hallucination problem.In Inner Monologue [71], the agent aims to perform high-level natural language instructions in a 3D visual environment.It is given the capability to actively solicit feedback from humans regarding scene descriptions.Then, the agent incorporates the human feedback into its prompts, enabling more informed planning and reasoning.In the above cases, we can see, different types of feedback can be combined to enhance the agent planning capability.For example, Inner Monologue [71] collects both environment and human feedback to facilitate the agent plans.
• Model Feedback.Apart from the aforementioned environmental and human feedback, which are external signals, researchers have also investigated the utilization of internal feedback from the agents themselves.This type of feedback is usually generated based on pre-trained models.In specific, [107] proposes a self-refine mechanism.This mechanism consists of three crucial components: output, feedback, and refinement.Firstly, the agent generates an output.Then, it utilizes LLMs to provide feedback on the output and offer guidance on how to refine it.At last, the output is improved by the feedback and refinement.This output-feedback-refinement process iterates until reaching some desired conditions.SelfCheck [112] allows agents to examine and evaluate their reasoning steps generated at various stages.They can then correct any errors by comparing the outcomes.InterAct [20] uses different language models (such as ChatGPT and InstructGPT) as auxiliary roles, such as checkers and sorters, to help the main language model avoid erroneous and inefficient actions.ChatCoT [22] utilizes model feedback to improve the quality of its reasoning process.The model feedback is generated by an evaluation module that monitors the agent reasoning steps.Reflexion [139] is developed to enhance the agent's planning capability through detailed verbal feedback.In this model, the agent first produces an action based on its memory, and then, the evaluator generates feedback by taking the agent trajectory as input.In contrast to previous studies, where the feedback is given as a scalar value, this model leverages LLMs to provide more detailed verbal feedback, which can provide more comprehensive supports for the agent plans.Remark.In conclusion, the implementation of the planning module without feedback is relatively straightforward.However, it is primarily suitable for simple tasks that only require a small number of reasoning steps.Conversely, the strategy of planning with feedback needs more careful designs to handle the feedback.Nevertheless, it is considerably more powerful and capable of effectively addressing complex tasks that involve long-range reasoning.

Action Module
The action module is responsible for translating the agent's decisions into specific outcomes.This module is located at the most downstream position and directly interacts with the environment.It is influenced by the profile, memory, and planning modules.This section introduces the action module from four perspectives: (1) Action goal: what are the intended outcomes of the actions?(2) Action production: how are the actions generated?(3) Action space: what are the available actions? (4) Action impact: what are the consequences of the actions?Among these perspectives, the first two focus on the aspects preceding the action ("before-action" aspects), the third focuses on the action itself ("in-action" aspect), and the fourth emphasizes the impact of the actions ("after-action" aspect).
Action Goal: The agent can perform actions with various objectives.Here, we present several representative examples: (1) Task Completion.In this scenario, the agent's actions are aimed at accomplishing specific tasks, such as crafting an iron pickaxe in Minecraft [148] or completing a function in software development [124].These actions usually have well-defined objectives, and each action contributes to the completion of the final task.Actions aimed at this type of goal are very common in existing literature.(2) Communication.In this case, the actions are taken to communicate with the other agents or real humans for sharing information or collaboration.For example, the agents in ChatDev [124] may communicate with each other to collectively accomplish software development tasks.In Inner Monologue [71], the agent actively engages in communication with humans and adjusts its action strategies based on human feedback.(3) Environment Exploration.In this example, the agent aims to explore unfamiliar environments to expand its perception and strike a balance between exploring and exploiting.For instance, the agent in Voyager [148] may explore unknown skills in their task completion process, and continually refine the skill execution code based on environment feedback through trial and error.
Action Production: Different from ordinary LLMs, where the model input and output are directly associated, the agent may take actions via different strategies and sources.In the following, we introduce two types of commonly used action production strategies.(1) Action via Memory Recollection.In this strategy, the action is generated by extracting information from the agent memory according to the current task.The task and the extracted memories are used as prompts to trigger the agent actions.For example, in Generative Agents [121], the agent maintains a memory stream, and before taking each action, it retrieves recent, relevant and important information from the memory steam to guide the agent actions.In GITM [184], in order to achieve a low-level sub-goal, the agent queries its memory to determine if there are any successful experiences related to the task.If similar tasks have been completed previously, the agent invokes the previously successful actions to handle the current task directly.In collaborative agents such as ChatDev [124] and MetaGPT [64], different agents may communicate with each other.In this process, the conversation history in a dialog is remembered in the agent memories.Each utterance generated by the agent is influenced by its memory.(2) Action via Plan Following.In this strategy, the agent takes actions following its pre-generated plans.For instance, in DEPS [154], for a given task, the agent first makes action plans.If there are no signals indicating plan failure, the agent will strictly adhere to these plans.In GITM [184], the agent makes high-level plans by decomposing the task into many sub-goals.Based on these plans, the agent takes actions to solve each sub-goal sequentially to complete the final task.
Action Space: Action space refers to the set of possible actions that can be performed by the agent.In general, we can roughly divide these actions into two classes: (1) external tools and (2) internal knowledge of the LLMs.In the following, we introduce these actions more in detail.
• External Tools.While LLMs have been demonstrated to be effective in accomplishing a large amount of tasks, they may not work well for the domains which need comprehensive expert knowledge.In addition, LLMs may also encounter hallucination problems, which are hard to be resolved by themselves.To alleviate the above problems, the agents are empowered with the capability to call external tools for executing action.In the following, we present several representative tools which have been exploited in the literature.
(1) APIs.Leveraging external APIs to complement and expand action space is a popular paradigm in recent years.For example, HuggingGPT [138] leverages the models on HuggingFace to accomplish complex user tasks.[115,130] propose to automatically generate queries to extract relevant content from external web pages when responding to user request.TPTU [130] interfaces with both Python interpreters and LaTeX compilers to execute sophisticated computations such as square roots, factorials and matrix operations.Another type of APIs is the ones that can be directly invoked by LLMs based on natural language or code inputs.For instance, Gorilla [123] is a fine-tuned LLM designed to generate accurate input arguments for API calls and mitigate the issue of hallucination during external API invocations.ToolFormer [133] is an LLM-based tool transformation system that can automatically convert a given tool into another one with different functionalities or formats based on natural language instructions.API-Bank [90] is an LLM-based API recommendation agent that can automatically search and generate appropriate API calls for various programming languages and domains.API-Bank also provides an interactive interface for users to easily modify and execute the generated API calls.ToolBench [126] is an LLM-based tool generation system that can automatically design and implement various practical tools based on natural language requirements.The tools generated by ToolBench include calculators, unit converters, calendars, maps, charts, etc. RestGPT [142] connects LLMs with RESTful APIs, which follow widely accepted standards for web services development, making the resulting program more compatible with real-world applications.TaskMatrix.AI [93] connects LLMs with millions of APIs to support task execution.At its core lies a multimodal conversational foundational model that interacts with users, understands their goals and context, and then produces executable code for particular tasks.All these agents utilize external APIs as their external tools, and provide interactive interfaces for users to easily modify and execute the generated or transformed tools.
(2) Databases & Knowledge Bases.Connecting to external database or knowledge base can help the agents to obtain specific domain information for generating more realistic actions.For example, ChatDB [67] employs SQL statements to query databases, facilitating actions by the agents in a logical manner.MRKL [80] and OpenAGI [56] incorporate various expert systems such as knowledge bases and planners to access domain-specific information.
(3) External Models.Previous studies often utilize external models to expand the range of possible actions.In comparison to APIs, external models typically handle more complex tasks.Each external model may correspond to multiple APIs.For example, to enhance the text retrieval capability, MemoryBank [179] incorporates two language models: one is designed to encode the input text, while the other is responsible for matching the query statements.ViperGPT [144] firstly uses Codex, which is implemented based on language model, to generate Python code from text descriptions, and then executes the code to complete the given tasks.TPTU [130] incorporates various LLMs to accomplish a wide range of language generation tasks such as generating code, producing lyrics, and more.ChemCrow [10] is an LLM-based chemical agent designed to perform tasks in organic synthesis, drug discovery, and material design.It utilizes seventeen expert-designed models to assist its operations.MM-REACT [167] integrates various external models, such as X-decoder for image generation, VideoBERT for video summarization, and SpeechBERT for audio processing, enhancing its capability in diverse multimodal scenarios.
• Internal Knowledge.In addition to utilizing external tools, many agents rely solely on the internal knowledge of LLMs to guide their actions.We now present several crucial capabilities of LLMs that can support the agent to behave reasonably and effectively.(1) Planning Capability.Previous work has demonstrated that LLMs can be used as decent planers to decompose complex task into simpler ones [155].Such capability of LLMs can be even triggered without incorporating examples in the prompts [82].Based on the planning capability of LLMs, DEPS [154] develops a Minecraft agent, which can solve complex task via sub-goal decomposition.Similar agents like GITM [184] and Voyager [148] also heavily rely on the planning capability of LLMs to successfully complete different tasks.(2) Conversation Capability.LLMs can usually generate high-quality conversations.This capability enables the agent to behave more like humans.In the previous work, many agents take actions based on the strong conversation capability of LLMs.For example, in ChatDev [124], different agents can discuss the software developing process, and even can make reflections on their own behaviors.In RLP [54], the agent can communicate with the listeners based on their potential feedback on the agent's utterance.(3) Common Sense Understanding Capability.Another important capability of LLMs is that they can well comprehend human common sense.Based on this capability, many agents can simulate human daily life and make human-like decisions.For example, in Generative Agent, the agent can accurately understand its current state, the surrounding environment, and summarize high-level ideas based on basic observations.Without the common sense understanding capability of LLMs, these behaviors cannot be reliably simulated.Similar conclusions may also apply to RecAgent [149] and S3 [55], where the agents aim to simulate user recommendation and social behaviors.
Action Impact: Action impact refers to the consequences of the action.In fact, the action impact can encompass numerous instances, but for brevity, we only provide a few examples.(1) Changing Environments.Agents can directly alter environment states by actions, such as moving their positions, collecting items, constructing buildings, etc.For instance, in GITM [184] and Voyager [148], the environments are changed by the actions of the agents in their task completion process.For example, if the agent mines three woods, then they may disappear in the environments.(2) Altering Internal States.Actions taken by the agent can also change the agent itself, including updating memories, forming new plans, acquiring novel knowledge, and more.For example, in Generative Agents [121], memory streams are updated after performing actions within the system.SayCan [2] enables agents to take actions to update understandings of the environment.(3) Triggering New Actions.In the task completion process, one agent action can be triggered by another one.For example, Voyager [148] constructs buildings once it has gathered all the necessary resources.DEPS [154] decomposes plans into sequential sub-goals, with each sub-goal potentially triggering the next one.

Agent Capability Acquisition
In the above sections, we mainly focus on how to design the agent architecture to better inspire the capability of LLMs to make it qualified for accomplishing tasks like humans.The architecture functions as the "hardware" of the agent.However, relying solely on the hardware is insufficient for achieving effective task performance.This is because the agent may lack the necessary task-specific capabilities, skills and experiences, which can be regarded as "software" resources.In order to equip the agent with these resources, various strategies have been devised.Generally, we categorize these strategies into two classes based on whether they require fine-tuning of the LLMs.In the following, we introduce each of them more in detail.
Capability Acquisition with Fine-tuning: A straightforward method to enhance the agent capability for task completion is fine-tuning the agent based on task-dependent datasets.Generally, the datasets can be constructed based on human annotation, LLM generation or collected from real-world applications.In the following, we introduce these methods more in detail.
• Fine-tuning with Human Annotated Datasets.To fine-tune the agent, utilizing human annotated datasets is a versatile approach that can be employed in various application scenarios.In this approach, researchers first design annotation tasks and then recruit workers to complete them.For example, in CoH [101], the authors aim to align LLMs with human values and preferences.Different from the other models, where the human feedback is leveraged in a simple and symbolic manner, this method converts the human feedback into detailed comparison information in the form of natural languages.The LLMs are directly fine-tuned based on these natural language datasets.In RET-LLM [114], in order to better convert natural languages into structured memory information, the authors fine-tune LLMs based on a human constructed dataset, where each sample is a "triplet-natural language" pair.In WebShop [168], the authors collect 1.18 million real-world products form amazon.com, and put them onto a simulated e-commerce website, which contains several carefully designed human shopping scenarios.Based on this website, the authors recruit 13 workers to collect a real-human behavior dataset.At last, three methods based on heuristic rules, imitation learning and reinforcement learning are trained based on this dataset.Although the authors do not fine-tune LLM-based agents, Table 1: Summary of the construction strategies of representative agents (more agents can be seen on https://github.com/Paitesanshi/LLM-Agent-Survey).For the profile module, we use ①, ② and ③ to represent the handcrafting method, LLM-generation method, and dataset alignment method, respectively.For the memory module, we focus on the implementation strategies for memory operation and memory structure.For memory operation, we use ① and ② to indicate that the model only has read/write operations and has read/write/reflection operations, respectively.For memory structure, we use ① and ② to represent unified and hybrid memories, respectively.For the planning module, we use ① and ② to represent planning w/o feedback and w/ feedback, respectively.For the action module, we use ① and ② to represent that the model does not use tools and use tools, respectively.For the agent capability acquisition (CA) strategy, we use ① and ② to represent the methods with and without fine-tuning, respectively."-" indicates that the corresponding content is not explicitly discussed in the paper.

Model
we believe that the dataset proposed in this paper holds immense potential to enhance the capabilities of agents in the field of web shopping.In EduChat [27], the authors aim to enhance the educational functions of LLMs, such as open-domain question answering, essay assessment, Socratic teaching, and emotional support.They fine-tune LLMs based on human annotated datasets that cover various educational scenarios and tasks.These datasets are manually evaluated and curated by psychology experts and frontline teachers.SWIFTSAGE [97] is an agent influenced by the dual-process theory of human cognition [51], which is effective for solving complex interactive reasoning tasks.In The era of machine learning

Parameter Learning
The era of large language model

Prompt Engineering
The era of agent

Mechanism Engineering
Prompt Engineering this agent, the SWIFT module constitutes a compact encoder-decoder language model, which is fine-tuned using human-annotated datasets.

Model
• Fine-tuning with LLM Generated Datasets.Building human annotated dataset needs to recruit people, which can be costly, especially when one needs to annotate a large amount of samples.Considering that LLMs can achieve human-like capabilities in a wide range of tasks, a natural idea is using LLMs to accomplish the annotation task.While the datasets produced from this method can be not as perfect as the human annotated ones, it is much cheaper, and can be leveraged to generate more samples.For example, in ToolBench [126], to enhance the tool-using capability of open-source LLMs, the authors collect 16,464 real-world APIs spanning 49 categories from the RapidAPI Hub.They used these APIs to prompt ChatGPT to generate diverse instructions, covering both single-tool and multi-tool scenarios.Based on the obtained dataset, the authors fine-tune LLaMA [146], and obtain significant performance improvement in terms of tool using.In [102], to empower the agent with social capability, the authors design a sandbox, and deploy multiple agents to interact with each other.Given a social question, the central agent first generates initial responses.Then, it shares the responses to its nearby agents for collecting their feedback.Based on the feedback as well as its detailed explanations, the central agent revise its initial responses to make them more consistent with social norms.In this process, the authors collect a large amount of agent social interaction data, which is then leveraged to fine-tune the LLMs.
• Fine-tuning with Real-world Datasets.In addition to building datasets based on human or LLM annotation, directly using real-world datasets to fine-tune the agent is also a common strategy.For example, in MIND2WEB [30], the authors collect a large amount of real-world datasets to enhance the agent capability in the web domain.In contrast to prior studies, the dataset presented in this paper encompasses diverse tasks, real-world scenarios, and comprehensive user interaction patterns.Specifically, the authors collect over 2,000 open-ended tasks from 137 real-world websites spanning 31 domains.Using this dataset, the authors fine-tune LLMs to enhance their performance on webrelated tasks, including movie discovery and ticket booking, among others.In SQL-PALM [143], researchers fine-tune PaLM-2 based on a cross-domain large-scale text-to-SQL dataset called Spider.The obtained model can achieve significant performance improvement on text-to-SQL tasks.
Capability Acquisition without Fine-tuning: In the era of tradition machine learning, the model capability is mainly acquired by learning from datasets, where the knowledge is encoded into the model parameters.In the era of LLMs, the model capability can be acquired either by training/finetuning the model parameters or designing delicate prompts (i.e., prompt engineer).In prompt engineer, one needs to write valuable information into the prompts to enhance the model capability or unleash existing LLM capabilities.In the era of agents, the model capability can be acquired based on three strategies: (1) model fine-tuning, (2) prompt engineer and (3) designing proper agent evolution mechanisms (we called it as mechanism engineering).Mechanism engineering is a broad concept that involves developing specialized modules, introducing novel working rules, and other strategies to enhance agent capabilities.For clearly understanding such transitions on the strategy of model capability acquisition, we illustrate them in Figure 4.In the above section, we have detailed the strategy of fine-tuning.In the following, we introduce prompting engineering and mechanism engineering for agent capability acquisition.
• Prompting Engineering.Due to the strong language comprehension capabilities, people can directly interact with LLMs using natural languages.This introduces a novel strategy for enhancing agent capabilities, that is, one can describe the desired capability using natural language and then use it as prompts to influence LLM actions.For example, in CoT [155], in order to empower the agent with the capability for complex task reasoning, the authors present the intermediate reasoning steps as few-shot examples in the prompt.Similar techniques are also used in CoT-SC [151] and ToT [169].In SocialAGI [54], in order to enhance the agent self-awareness capability in conversation, the authors prompt LLMs with the agent beliefs about the mental states of the listeners and itself, which makes the generated utterance more engaging and adaptive.In addition, the authors also incorporate the target mental states of the listeners, which enables the agents to make more strategic plans.Retroformer [171] presents a retrospective model that enables the agent to generate reflections on its past failures.The reflections are integrated into the prompt of LLMs to guide the agent's future actions.Additionally, this model utilizes reinforcement learning to iteratively improve the retrospective model, thereby refining the LLM prompt.
• Mechanism Engineering.Different from model fine-tuning and prompt engineering, mechanism engineering is a unique strategy to enhance the agent capability.In the following, we present several representative methods for mechanism engineering.
(1) Trial-and-error.In this method, the agent first performs an action, and subsequently, a pre-defined critic is invoked to judge the action.If the action is deemed unsatisfactory, then the agent reacts by incorporating the critic's feedback.In RAH [140], the agent serves as a user assistant in recommender systems.One of the agent's crucial roles is to simulate human behavior and generate responses on behalf of the user.To fulfill this objective, the agent first generates a predicted response and then compares it with the real human feedback.If the predicted response and the real human feedback differ, the critic generates failure information, which is subsequently incorporated into the agent's next action.In DEPS [154], the agent first designs a plan to accomplish a given task.In the plan execution process, if an action fails, the explainer generates specific details explaining the cause of the failure.This information is then incorporated by the agent to redesign the plan.In RoCo [108], the agent first proposes a sub-task plan and a path of 3D waypoints for each robot in a multi-robot collaboration task.The plan and waypoints are then validated by a set of environment checks, such as collision detection and inverse kinematics.If any of the checks fail, the feedback is appended to each agent's prompt and another round of dialog begins.The agents use LLMs to discuss and improve their plan and waypoints until they pass all validations.In PREFER [173], the agent first evaluates its performance on a subset of data.If it fails to solve certain examples, LLMs are leveraged to generate feedback information reflecting on the reasons of the failure.Based on this feedback, the agent improves itself by iteratively refining its actions.
(2) Crowd-sourcing.In [35], the authors design a debating mechanism that leverages the wisdom of crowds to enhance the capabilities of the agent.To begin with, different agents provide separate responses to a given question.If their responses are not consistent, they will be prompted to incorporate the solutions from other agents and provide an updated response.This iterative process continues until reaching a final consensus answer.In this method, the capability of each agent is enhance by understanding and incorporating the other agents' opinions.
(3) Experience Accumulation.In GITM [184], the agent does not know how to solve a task in the beginning.Then, it makes explorations, and once it has successfully accomplished a task, the actions used in this task are stored into the agent memory.In the future, if the agent encounters a similar task, then the relevant memories are extracted to complete the current task.In this process, the improved agent capability comes from the specially designed memory accumulation and utilization mechanisms.In Voyager [148], the authors equip the agent with a skill library, and each skill in the library is represented by executable codes.In the agent-environment interaction process, the codes for each skill will be iteratively refined according to the environment feedback and the agent self-verification results.After a period of execution, the agent can successfully complete different tasks efficiently by accessing the skill library.In MemPrompt [106], the users are requested to provide feedback in natural language regarding the problem-solving intentions of the agent, and this feedback is stored in memory.When the agent encounters similar tasks, it attempts to retrieve related memories to generate more suitable responses.
(4) Self-driven Evolution.In LMA3 [24], the agent can autonomously set goals for itself, and gradually improve its capability by exploring the environment and receiving feedback from a reward function.Following this mechanism, the agent can acquire knowledge and develop capabilities according to its own preferences.In SALLM-MS [116], by integrating advanced large language models like GPT-4 into a multi-agent system, agents can adapt and perform complex tasks, showcasing advanced communication capabilities, thereby realizing self-driven evolution in their interactions with the environment.In CLMTWA [132], by using a large language model as a teacher and a weaker language model as a student, the teacher can generate and communicate natural language explanations to improve the student's reasoning skills via theory of mind.The teacher can also personalize its explanations for the student and intervene only when necessary, based on the expected utility of intervention.In NLSOM [185], different agents communicate and collaborate through natural language to solve tasks that a single agent cannot solve.This can be seen as a form of self-driven learning, utilizing the exchange of information and knowledge between multiple agents.However, unlike other models such as LMA3, SALLM-MS, and CLMTWA, NLSOM allows for dynamic adjustment of agent goals, roles, tasks, and relationships based on the task requirements and the feedback from other agents or the environment.
Remark.Upon comparing the aforementioned strategies for agent capability acquisition, we can find that the fine-tuning method improves the agent capability by adjusting model parameters, which can incorporate a large amount of task-specific knowledge, but is only suitable for open-source LLMs.The method without fine-tuning usually enhances the agent capability based on delicate prompting strategies or mechanism engineering.They can be used for both open-and closed-source LLMs.However, due to the limitation of the input context window of LLMs, they cannot incorporate too much task information.In addition, the designing spaces of the prompts and mechanisms are extremely large, which makes it not easy to find optimal solutions.
In the above sections, we have detailed the construction of LLM-based agents, where we focus on two aspects including the architecture design and capability acquisition.We present the correspondence between existing work and the above taxonomy in Table 1.It should be noted that, for the sake of integrity, we have also incorporated several studies, which do not explicitly mention LLM-based agents but are highly related to this area.

LLM-based Autonomous Agent Application
Owing to the strong language comprehension, complex task reasoning, and common sense understanding capabilities, LLM-based autonomous agents have shown significant potential to influence multiple domains.This section provides a succinct summary of previous studies, categorizing them according to their applications in three distinct areas: social science, natural science, and engineering (see the left part of Figure 5 for a global overview).

Social Science
Social science is one of the branches of science, devoted to the study of societies and the relationships among individuals within those societies ‡ .LLM-based autonomous agents can promote this domain by leveraging their impressive human-like understanding, thinking and task solving capabilities.In the following, we discuss several key areas that can be affected by LLM-based autonomous agents.
Psychology: For the domain of psychology, LLM-based agents can be leveraged for conducting simulation experiments, providing mental health support and so on [1,3,105,187].For example, in [1], the authors assign LLMs with different profiles, and let them complete psychology experiments.
From the results, the authors find that LLMs can produce outcomes consistent with those of realhuman studies.In addition, larger models can usually provide more faithful simulation results than the smaller ones.An interesting discovery is that, in many experiments, models like ChatGPT and GPT-4 can provide too perfect estimates (called "hyper-accuracy distortion"), which may influence the downstream applications.In [105], the authors systematically analyze the effectiveness of LLMbased conversation agents for mental well-being support.They collect 120 posts from Reddit, and find that such agents can help users cope with anxieties, social isolation and depression on demand.At the same time, they also find that the agents may produce harmful contents sometimes.
Political Science and Economy: LLM-based agents can also be utilized to study political science and economy [5,187,65].In [5], LLM-based agents are utilized for ideology detection and predicting voting patterns.In [187], the authors focuses on understanding the discourse structure and persuasive elements of political speech through the assistance of LLM-based agents.In [65], LLM-based agents are provided with specific traits such as talents, preferences, and personalities to explore human economic behaviors in simulated scenarios.
Social Simulation: Previously, conducting experiments with human societies is often expensive, unethical, or even infeasible.With the ever prospering of LLMs, many people explore to build virtual environment with LLM-based agents to simulate social phenomena, such as the propagation of harmful information, and so on [122,91,86,121,99,83,55,156].For example, Social Simulacra [122] simulates an online social community and explores the potential of utilizing agent-based simulations to aid decision-makers to improve community regulations.[91,86] investigates the potential impacts of different behavioral characteristics of LLM-based agents in social networks.
Generative Agents [121] and AgentSims [99] construct multiple agents in a virtual town to simulate the human daily life.SocialAI School [83] employs LLM-based agents to simulate and investigate the fundamental social cognitive skills during the course of child development.S 3 [55] builds a social network simulator, focusing on the propagation of information, emotion and attitude.CGMI [75] is a framework for multi-agent simulation.CGMI maintains the personality of the agents through a tree structure and constructs a cognitive model.The authors simulated a classroom scenario using CGMI.
Jurisprudence: LLM-based agents can serve as aids in legal decision-making processes, facilitating more informed judgements [25,61].Blind Judgement [61] employs several language models to simulate the decision-making processes of multiple judges.It gathers diverse opinions and consolidates the outcomes through a voting mechanism.ChatLaw [25] is a prominent Chinese legal model based on LLM.It supports both database and keyword search strategies to alleviate the hallucination problem.In addition, this model also employs self-attention mechanism to enhance the LLM's capability via mitigating the impact of reference inaccuracies.
Research Assistant: In addition to specific domains, LLM-based agents can also be used as general social science research assistants [6,187].In [187], LLM-based agents are used to assist researchers in various tasks, such as generating article abstracts, extracting keywords, and creating scripts.In [6], LLM-based agents serve as a writing assistant, where they possess the capability to identify novel research inquiries for social scientists.

Natural Science
Natural science is one of the branches of science concerned with the description, understanding and prediction of natural phenomena, based on empirical evidence from observation and experimentation § .With the ever prospering of LLMs, the application of LLM-based agents in natural sciences becomes more and more popular.In the following, we present many representative areas, where LLM-based agents can play important roles.
Documentation and Data Management: Natural scientific research often involves the collection, organization, and synthesis of substantial amounts of literature, which requires a significant dedication of time and human resources.LLM-based agents have shown strong capabilities on language understanding and employing tools such as the internet and databases for text processing.These capabilities empower the agent to excel in tasks related to documentation and data management [9,79,10].In [9], the agent can efficiently query and utilize internet information to complete tasks such as question answering and experiment planning.ChatMOF [79] utilizes LLMs to extract important information from text descriptions written by humans.It then formulates a plan to apply relevant tools for predicting the properties and structures of metal-organic frameworks.ChemCrow [10] utilizes chemistry-related databases to both validate the precision of compound representations and identify potentially dangerous substances.This functionality enhances the reliability and comprehensiveness of scientific inquiries by ensuring the accuracy of the data involved.
Experiment Assistant: LLM-based agents have the ability to independently conduct experiments, making them valuable tools for supporting scientists in their research projects [9,10].For instance, [9] introduces an innovative agent system that utilizes LLMs for automating the design, planning, and execution of scientific experiments.This system, when provided with the experimental objectives as input, accesses the Internet and retrieves relevant documents to gather the necessary information.It subsequently utilizes Python code to conduct essential calculations and carry out the following experiments.ChemCrow [10] incorporates 17 carefully developed tools that are specifically designed to assist researchers in their chemical research.Once the input objectives are received, ChemCrow provides valuable recommendations for experimental procedures, while also emphasizing any potential safety risks associated with the proposed experiments.
Natural Science Education: LLM-based agents can communicate with humans fluently, often being utilized to develop agent-based educational tools [9,145,34,19].For example, [9] develops agentbased education systems to facilitate students learning of experimental design, methodologies, and analysis.The objective of these systems is to enhance students' critical thinking and problem-solving skills, while also fostering a deeper comprehension of scientific principles.Math Agents [145] can assist researchers in exploring, discovering, solving and proving mathematical problems.Additionally, it can communicate with humans and aids them in understanding and using mathematics.[34] utilize the capabilities of CodeX [19] to automatically solve and explain university-level mathematical problems, which can be used as education tools to teach students and researchers.CodeHelp [95] is an education agent for programming.It offers many useful features, such as setting course-specific keywords, monitoring student queries, and providing feedback to the system.EduChat [27] is an LLM-based agent designed specifically for the education domain.It provides personalized, equitable, and empathetic educational support to teachers, students, and parents through dialogue.Furthermore, by utilizing a diverse range of system prompts, it effectively addresses the issue of hallucination and seamlessly adapts to actual educational scenarios.FreeText [109] is an agent that utilizes LLMs to automatically assess students' responses to open-ended questions and offer feedback.

Engineering
LLM-based autonomous agents have shown great potential in assisting and enhancing engineering research and applications.In this section, we review and summarize the applications of LLM-based agents in several major engineering domains.
Civil Engineering: In civil engineering, LLM-based agents can be used to design and optimize complex structures such as buildings, bridges, dams, roads, etc. [110] proposes an interactive framework where human architects and agents collaborate to construct structures in a 3D simulation environment.The interactive agent can understand natural language instructions, place blocks, detect confusion, seek clarification, and incorporate human feedback, showing the potential for human-AI collaboration in engineering design.
Computer Science & Software Engineering: In the field of computer science and software engineering, LLM-based agents offer potential for automating coding, testing, debugging, and documentation generation [126,124,64,33,37,48,45].ChatDev [124] proposes an end-to-end framework, where multiple agent roles communicate and collaborate through natural language conversations to complete the software development life cycle.This framework demonstrates efficient and cost-effective generation of executable software systems.ToolBench [126] can be used for tasks such as code auto-completion and code recommendation.MetaGPT [64] abstracts multiple roles, such as product managers, architects, project managers, and engineers, to supervise code generation process and enhance the quality of the final output code.This enables low-cost software development.[33] Table 2: Representative applications of LLM-based autonomous agents.
Industrial Automation: In the field of industrial automation, LLM-based agents can be used to achieve intelligent planning and control of production processes.[161] proposes a novel framework that integrates large language models (LLMs) with digital twin systems to accommodate flexible production needs.The framework leverages prompt engineering techniques to create LLM agents that can adapt to specific tasks based on the information provided by digital twins.These agents can coordinate a series of atomic functionalities and skills to complete production tasks at different levels within the automation pyramid.This research demonstrates the potential of integrating LLMs into industrial automation systems, providing innovative solutions for more agile, flexible and adaptive production processes.IELLM [119] presents a comprehensive case study on LLMs' effectiveness in addressing challenges in the oil and gas industry.It focuses on various applications, including rock physics modeling, acoustic reflectometry, and coiled tubing control.
Robotics & Embodied Artificial Intelligence: Recent works have developed more efficient reinforcement learning agents for robotics and embodied artificial intelligence [28,181,118,160,148,184,66,159,174,32,2].The focus is on enhancing autonomous agents' abilities for planning, reasoning, and collaboration in embodied environments.In specific, [28] proposes a unified agent system for embodied reasoning and task planning.In this system, the authors design high-level commands to enable improved planning while propose low-level controllers to translate commands into actions.Additionally, one can leverage dialogues to gather information [181] to accelerate the optimization process.[118,160] employ autonomous agents for embodied decision-making and exploration.To overcome the physical constraints, the agents can generate executable plans and accomplish long-term tasks by leveraging multiple skills.In terms of control policies, SayCan [2] focuses on investigating a wide range of manipulation and navigation skills utilizing a mobile manipulator robot.Taking inspiration from typical tasks encountered in a kitchen environment, it presents a comprehensive set of 551 skills that cover seven skill families and 17 objects.These skills encompass various actions such as picking, placing, pouring, grasping, and manipulating objects, among others.TidyBot [157] is an embodied agent designed to personalize household cleanup tasks.It can learn users' preferences on object placement and manipulation methods through textual examples.
To promote the application of LLM-based autonomous agents, researchers have also introduced many open-source libraries, based on which the developers can quickly implement and evaluate agents according to their customized requirements [49,47,42,44,39,40,46,16,36,43,38,125,52,45,41,50,158].For example, LangChain [16] is an open-source framework that automates coding, testing, debugging, and documentation generation tasks.By integrating language models with data sources and facilitating interaction with the environment, LangChain enables efficient and cost-effective software development through natural language communication and collaboration among multiple agent roles.Based on LangChain, XLang [40] comes with a comprehensive set of tools, a complete user interface, and support three different agent scenarios, namely data processing, plugin usage, and web agent.AutoGPT [49] is an agent that is fully automated.It sets one or more goals, breaks them down into corresponding tasks, and cycles through the tasks until the goal is achieved.WorkGPT [36] is an agent framework similar to AutoGPT and LangChain.By providing it with an instruction and a set of APIs, it engages in back-and-forth conversations with AI until the instruction is completed.GPT-Engineer [37], SmolModels [48] and DemoGPT [45] are open-source projects that focus on automating code generation through prompts to complete development tasks.AGiXT [44] is a dynamic AI automation platform designed to orchestrate efficient AI command management and task execution across many providers.AgentVerse [39] is a versatile framework that facilitates researchers in creating customized LLM-based agent simulations efficiently.GPT Researcher [38] is an experimental application that leverages large language models to efficiently develop research questions, trigger web crawls to gather information, summarize sources, and aggregate summaries.BMTools [125] is an open-source repository that extends LLMs with tools and provides a platform for community-driven tool building and sharing.It supports various types of tools, enables simultaneous task execution using multiple tools, and offers a simple interface for loading plugins via URLs, fostering easy development and contribution to the BMTools ecosystem.
Remark.The utilization of LLM-based agents in supporting the above applications may also entail certain risks and challenges.On one hand, LLMs themselves may be susceptible to illusions and other issues, occasionally providing erroneous answers, leading to incorrect conclusions, experimental failures, or even posing risks to human safety in hazardous experiments.Therefore, during experimentation, users must possess the necessary expertise and knowledge to exercise appropriate caution.On the other hand, LLM-based agents could potentially be exploited for malicious purposes, such as the development of chemical weapons, necessitating the implementation of security measures, such as human alignment, to ensure responsible and ethical use.
In summary, in the above sections, we introduce the typical applications of LLM-based autonomous agents in three important domains.For more clear understanding, we summarize the correspondence between the previous work and their applications in Table 2.

LLM-based Autonomous Agent Evaluation
Similar to LLMs themselves, evaluating the effectiveness of LLM-based autonomous agents is a challenging task.This section introduces two commonly used evaluation strategies, that is, subjective and objective evaluation (see the right part of Figure 5 for an overview).

Subjective Evaluation
Subjective evaluation measures the agent capabilities based on human judgements [85,122,121,5,176].It is suitable for the scenarios where there are no evaluation datasets or it is very hard to design quantitative metrics, for example, evaluating the agent's intelligence or user-friendliness.In the following, we present two commonly used strategies for subjective evaluation.
Human Annotation: In this method, human evaluators directly score or rank the results produced from different agents [187,5,176].For example, in [121], the authors employ many annotators, and ask them to provide feedback on five key questions that directly associated with the agent capability.In [84], the authors evaluate the model effectiveness by asking humans to score on the model harmless, honest, helpful, engagement and unbiasedness, and then compare the results from different models.In [122], the authors ask the annotator to answer whether their designed model can effectively contribute to improving the rule design for online communities.
Turing Test: In this method, human evaluators are required to distinguish between outcomes generated by the agents and real humans.If, in a given task, the evaluators cannot separate the agent and human results, it demonstrates that the agent can achieve human-like performance on this task.
In [5], the authors conduct experiments on free-form Partisan text, and the human evaluators are asked to guess whether the responses are from human or LLM-based agent.In [121], the human evaluators are required to identify whether the behaviors are generated from the agents or real-humans.In [68], the authors conduct a study in which they gathered human annotations on the emotional states of both LLM software and human subjects in different situations.They utilized these annotations as a baseline to assess the emotional robustness of the LLM software.Remark.LLM-based agents are usually designed to serve humans.Thus, subjective agent evaluation plays a critical role, since it reflects human criterion.However, this strategy also faces issues such as high costs, inefficiency, and population bias.To solve these problems, many researchers have explored to leverage LLMs as proxies to conduct subjective evaluation.For example, in ChemCrow [10], researchers assess the experimental results using GPT.They consider both the completion of tasks and the accuracy of the underlying processes.ChatEval [13] employs multiple agents to assess the outcomes produced by candidate models in a debating manner.We believe that with the progress of LLMs, such evaluation method can be more credible and applied in wider applications.

Objective Evaluation
Objective evaluation refers to assessing the capabilities of LLM-based autonomous agents using quantitative metrics that can be computed, compared and tracked over time.In contrast to subjective evaluation, objective metrics aim to provide concrete, measurable insights into the agent performance.For conducting objective evaluation, there are three important aspects, that is, the evaluation metrics, protocols and benchmarks.In the following, we introduce these aspects more in detail.

Metrics:
In order to objectively evaluate the effectiveness of the agents, designing proper metrics is significant, which may influence the evaluation accuracy and comprehensiveness.Ideal evaluation metrics should precisely reflect the quality of the agents, and align with the human feelings when using them in real-world scenarios.In existing work, we can conclude the following representative evaluation metrics.(1) Task success metrics: These metrics measure how well an agent can complete tasks and achieve goals.Common metrics include success rate [176,170,139,100],  [17,148], dialogue similarities [122,1], and mimicry of human responses [1,5].Higher similarity suggests better human simulation performance.(3) Efficiency metrics: In contrast to the aforementioned metrics used to evaluate the agent effectiveness, these metrics assess the agent efficiency.Typical metrics include planning length [100], development cost [124], inference speed [184,148], and number of clarification dialogues [110].
Protocols: In addition to the evaluation metrics, another important aspect for objective evaluation is how to leverage these metrics.In the previous work, we can identify the following commonly used evaluation protocols: (1) Real-world simulation: In this method, the agents are evaluated within immersive environments like games and interactive simulators.The agents are required to perform tasks autonomously, and then metrics like task success rate and human similarity are leveraged to evaluate the capability of the agents based on their trajectories and completed objectives [17,176,184,170,148,110,154,94,168,175].This method is expected to evaluate the agents' practical capabilities in real-world scenarios.( 2 Benchmarks: Given the metrics and protocols, a crucial remaining aspect is the selection of an appropriate benchmark for conducting the evaluation.In the past, people have used various benchmarks in their experiments.For example, many researchers use simulation environments like ALFWorld [170], IGLU [110], and Minecraft [184,148,154] as benchmarks to evaluate the agent capabilities.Tachikuma [94] is a benchmark that leverages TRPG game logs to evaluate LLMs' ability to understand and infer complex interactions with multiple characters and novel objects.AgentBench [103] provides a comprehensive framework for evaluating LLMs as autonomous agents across diverse environments.It represents the first systematic assessment of LLMs as agents on realworld challenges across diverse domains.SocKET [23] is a comprehensive benchmark for evaluating the social capabilities of LLMs across 58 tasks covering five categories of social information such as humor and sarcasm, emotions and feelings, credibility, etc. AgentSims [99] is a versatile framework for evaluating LLM-based agents, where one can flexibly design the agent planning, memory and action strategies, and measure the effectiveness of different agent modules in interactive environments.ToolBench [125] is an open-source project that aims to support the development of powerful LLMs with general tool-use capability.
It provides an open platform for training, serving, and evaluating LLMs based on tool learning.WebShop [168] develops a benchmark for evaluating LLM-based agents in terms of their capabilities for product search and retrieval.The benchmark is constructed using a collection of 1.18 million real-world items.Mobile-Env [175] is an extendable interactive platform which can be used to evaluate the multi-step interaction capabilities of LLM-based agents.
WebArena [180] offers a comprehensive website environment that spans multiple domains.Its purpose is to evaluate agents in an end-to-end fashion and determine the accuracy of their completed tasks.GentBench [163] is a benchmark designed to evaluate the agent capabilities, including their reasoning, safety, and efficiency, when utilizing tools to complete complex tasks.RocoBench [108] is a benchmark with six tasks evaluating multi-agent collaboration across diverse scenarios, emphasizing communication and coordination strategies to assess adaptability and generalization in cooperative robotics.EmotionBench [68] evaluates the emotion appraisal ability of LLMs, i.e., how their feelings change when presented with specific situations.It collects over 400 situations that elicit eight negative emotions and measures the emotional states of LLMs and human subjects using self-report scales.PEB [29] is a benchmark tailored for assessing LLM-based agents in penetration testing scenarios, comprising 13 diverse targets from leading platforms.It offers a structured evaluation across varying difficulty levels, reflecting real-world challenges for agents.ClemBench [12] contains five Dialogue Games to assess LLMs' ability as a player.E2E [7] is an end-to-end benchmark for testing the accuracy and usefulness of chatbots.
Remark.Objective evaluation allows for the quantitative assessment of LLM-based agent capabilities using diverse metrics.While current techniques can not perfectly measure all types of agent capabilities, objective evaluation provides essential insights that complement subjective assessment.The ongoing progress in objective evaluation benchmarks and methodology will further advance the development and understanding of LLM-based autonomous agents.
In the above sections, we introduce both subjective and objective strategies for LLM-based autonomous agents evaluation.The evaluation of the agents play significant roles in this domain.However, both subjective and objective evaluation have their own strengths and weakness.Maybe, in practice, they should be combined to comprehensively evaluate the agents.We summarize the correspondence between the previous work and these evaluation strategies in Table 3.

Related Surveys
With the vigorous development of large language models, numerous comprehensive surveys have emerged, providing detailed insights into various aspects.[178] extensively introduces the background, main findings, and mainstream technologies of LLMs, encompassing a vast array of existing works.On the other hand, [166] primarily focus on the applications of LLMs in various downstream tasks and the challenges associated with their deployment.Aligning LLMs with human intelligence is an active area of research to address concerns such as biases and illusions.[153] have compiled existing techniques for human alignment, including data collection and model training methodologies.
Reasoning is a crucial aspect of intelligence, influencing decision-making, problem-solving, and other cognitive abilities.[69] presents the current state of research on LLMs' reasoning abilities, exploring approaches to improve and evaluate their reasoning skills.[111] propose that language models can be enhanced with reasoning capabilities and the ability to utilize tools, termed Augmented Language Models (ALMs).They conduct a comprehensive review of the latest advancements in ALMs.As the utilization of large-scale models becomes more prevalent, evaluating their performance is increasingly critical.[15] shed light on evaluating LLMs, addressing what to evaluate, where to evaluate, and how to assess their performance in downstream tasks and societal impact.[14] also discusses the capabilities and limitations of LLMs in various downstream tasks.The aforementioned research encompasses various aspects of large models, including training, application, and evaluation.However, prior to this paper, no work has specifically focused on the rapidly emerging and highly promising field of LLM-based Agents.In this study, we have compiled 100 relevant works on LLM-based Agents, covering their construction, applications, and evaluation processes.

Challenges
While previous work on LLM-based autonomous agent has obtained many remarkable successes, this field is still at its initial stage, and there are several significant challenges that need to be addressed in its development.In the following, we present many representative challenges.

Role-playing Capability
Different from traditional LLMs, autonomous agent usually has to play as specific roles (e.g., program coder, researcher and chemist) for accomplishing different tasks.Thus, the capability of the agent for role-playing is very important.Although LLMs can effectively simulate many common roles such as movie reviewers, there are still various roles and aspects that they struggle to capture accurately.To begin with, LLMs are usually trained based on web-corpus, thus for the roles which are seldom discussed on the web or the newly emerging roles, LLMs may not simulate them well.
In addition, previous research [54] has shown that existing LLMs may not well model the human cognitive psychology characters, leading to the lack of self-awareness in conversation scenarios.Potential solution to these problems may include fine-tuning LLMs or carefully designing the agent prompts/architectures [87].For example, one can firstly collect real-human data for uncommon roles or psychology characters, and then leverage it to fine-tune LLMs.However, how to ensure that fine-tuned model still perform well for the common roles may pose further challenges.Beyond fine-tuning, one can also design tailored agent prompts/architectures to enhance the capability of LLM on role-playing.However, finding the optimal prompts/architectures is not easy, since their designing spaces are too large.

Generalized Human Alignment
Human alignment has been discussed a lot for traditional LLMs.In the field of LLM-based autonomous agent, especially when the agents are leveraged for simulation, we believe this concept should be discussed more in depth.In order to better serve human-beings, traditional LLMs are usually fine-tuned to be aligned with correct human values, for example, the agent should not plan to make a bomb for avenging society.However, when the agents are leveraged for real-world simulation, an ideal simulator should be able to honestly depict diverse human traits, including the ones with incorrect values.Actually, simulating the human negative aspects can be even more important, since an important goal of simulation is to discover and solve problems, and without negative aspects means no problem to be solved.For example, to simulate the real-world society, we may have to allow the agent to plan for making a bomb, and observe how it will act to implement the plan as well as the influence of its behaviors.Based on these observations, people can make better actions to stop similar behaviors in real-world society.Inspired by the above case, maybe an important problem for agent-based simulation is how to conduct generalized human alignment, that is, for different purposes and applications, the agent should be able to align with diverse human values.However, existing powerful LLMs including ChatGPT and GPT-4 are mostly aligned with unified human values.Thus, an interesting direction is how to "realign" these models by designing proper prompting strategies.

Prompt Robustness
To ensure rational behavior in agents, designers often incorporate additional modules, such as memory and planning modules, into LLMs.However, the inclusion of these modules necessitates the development of more prompts in order to facilitate consistent operation and effective communication.
Previous research [186,57] has highlighted the lack of robustness in prompts for LLMs, as even minor alterations can yield substantially different outcomes.This issue becomes more pronounced when constructing autonomous agents, as they encompass not a single prompt but a prompt framework that considers all modules, wherein the prompt for one module has the potential to influence others.Moreover, the prompt frameworks can vary significantly across different LLMs.Developing a unified and robust prompt framework that can be applied to various LLMs is an important yet unresolved issue.There are two potential solutions to the aforementioned problems: (1) manually crafting the essential prompt elements through trial and error, or (2) automatically generating prompts using GPT.

Hallucination
Hallucination poses a fundamental challenge for LLMs, wherein the model erroneously outputs false information confidently.This issue is also prevalent in autonomous agents.For instance, in [74], it was observed that when confronted with simplistic instructions during code generation tasks, the agent may exhibit hallucinatory behavior.Hallucination can lead to serious consequences such as incorrect or misleading code, security risks, and ethical issues [74].To address this problem, one possible approach is to incorporate human correction feedback within the loop of human-agent interaction [64].More discussions on the hallucination problem can be seen in [178].

Knowledge Boundary
An important application of LLM-based autonomous agents is to simulate different real-world human behaviors [121].The study of human simulation has a long history, and the recent surge in interest can be attributed to the remarkable advancements made by LLMs, which have demonstrated significant capabilities in simulating human behavior.However, it is important to recognize that the power of LLMs may not always be advantageous.Specifically, an ideal simulation should accurately replicate human knowledge.In this regard, LLMs can exhibit excessive power, as they are trained on an extensive corpus of web knowledge that surpasses the scope of ordinary individuals.The immense capabilities of LLMs can significantly impact the effectiveness of simulations.For instance, when attempting to simulate user selection behaviors for various movies, it is crucial to ensure that LLMs assume a position of having no prior knowledge of these movies.However, there is a possibility that LLMs have already acquired information about these movies.Without implementing appropriate strategies, LLMs may make decisions based on their extensive knowledge, even though real-world users would not have access to the contents of these movies beforehand.Based on the above example, we may conclude that for building believable agent simulation environment, an important problem is how to constrain the utilization of user-unknown knowledge of LLM.

Efficiency
Because of its autoregressive architecture, LLMs typically have slow inference speeds.However, the agent may need to query LLMs for each action multiple times, such as extracting information from the memory module, make plans before taking actions and so on.Consequently, the efficiency of agent actions is greatly affected by the speed of LLM inference.Deploying multiple agents with the same API key can further significantly increase the time cost.

Conclusion
In this survey, we systematically summarize existing research in the field of LLM-based autonomous agents.We present and review these studies from three aspects including the construction, application, and evaluation of the agents.For each of these aspects, we provide a detailed taxonomy to draw connections among the existing research, summarizing the major techniques and their development histories.In addition to reviewing the previous work, we also propose several challenges in this field, which are expected to guide potential future directions.

Figure 2 :
Figure 2: A unified framework for the architecture design of LLM-based autonomous agent.

Figure 3 :
Figure 3: Comparison between the strategies of single-path and multi-path reasoning.LMZSP represents the model proposed in [70].

Figure 4 :
Figure 4: Illustration of transitions in strategies for acquiring model capabilities.

Figure 5 :
Figure 5: The applications (left) and evaluation strategies (right) of LLM-based agents.

Table 3 :
Summary on the evaluation strategies of LLM-based autonomous agents (more agents can be seen on https://github.com/Paitesanshi/LLM-Agent-Survey).For subjective evaluation, we use ① and ② to represent human annotation and the Turing test, respectively.For objective evaluation, we use ①, ②, ③, and ④ to represent environment simulation, social evaluation, multi-task evaluation, and software testing, respectively."✓" indicates that the evaluations are based on benchmarks.Human similarity metrics: These metrics quantify the degree to which the agent behaviors closely resembles that of humans.Typical examples include trajectory/location accuracy

)
[73,78,53,104]104,168,175]thod utilizes metrics to assess social intelligence based on the agent interactions in simulated societies.Various approaches have been adopted, such as collaborative tasks to evaluate teamwork skills, debates to analyze argumentative reasoning, and human studies to measure social aptitude[122,1,23,99,104].These approaches analyze qualities such as coherence, theory of mind, and social IQ to assess agents' capabilities in areas including cooperation, communication, empathy, and mimicking human social behavior.By subjecting agents to complex interactive settings, social evaluation provides valuable insights into agents' higher-level social cognition.(3)Multi-taskevaluation:Inthis method, people use a set of diverse tasks from different domains to evaluate the agent, which can effectively measure the agent generalization capability in open-domain environments[5,23,125,103,104,168,175].(4)Softwaretesting: In this method, researchers evaluate the agents by letting them conduct tasks such as software testing tasks, such as generating test cases, reproducing bugs, debugging code, and interacting with developers and external tools[73,78,53,104].Then, one can use metrics like test coverage and bug detection rate to measure the effectiveness of LLM-based agents.