A Whole Brain Probabilistic Generative Model: Toward Realizing Cognitive Architectures for Developmental Robots

Building a humanlike integrative artificial cognitive system, that is, an artificial general intelligence (AGI), is the holy grail of the artificial intelligence (AI) field. Furthermore, a computational model that enables an artificial system to achieve cognitive development will be an excellent reference for brain and cognitive science. This paper describes an approach to develop a cognitive architecture by integrating elemental cognitive modules to enable the training of the modules as a whole. This approach is based on two ideas: (1) brain-inspired AI, learning human brain architecture to build human-level intelligence, and (2) a probabilistic generative model(PGM)-based cognitive system to develop a cognitive system for developmental robots by integrating PGMs. The development framework is called a whole brain PGM (WB-PGM), which differs fundamentally from existing cognitive architectures in that it can learn continuously through a system based on sensory-motor information. In this study, we describe the rationale of WB-PGM, the current status of PGM-based elemental cognitive modules, their relationship with the human brain, the approach to the integration of the cognitive modules, and future challenges. Our findings can serve as a reference for brain studies. As PGMs describe explicit informational relationships between variables, this description provides interpretable guidance from computational sciences to brain science. By providing such information, researchers in neuroscience can provide feedback to researchers in AI and robotics on what the current models lack with reference to the brain. Further, it can facilitate collaboration among researchers in neuro-cognitive sciences as well as AI and robotics.


Introduction
Infants acquire a wide range of cognitive capabilities through daily physical and social interactions with their environment. Through this developmental process, they acquire basic physical skills (e.g., reaching and grasping), perceptual skills (e.g., object recognition and phoneme recognition), and social skills (e.g., linguistic communication and intention estimation) (Taniguchi et al., 2018b). This open-ended learning process involving many types of modalities, tasks, and interactions is often referred to as lifelong learning (Oudeyer et al., 2007;Parisi et al., 2019). The central question in next-generation artificial intelligence (AI) and developmental robotics is how to build an integrative cognitive system capable of lifelong learning and human-like behavior in various environments such as homes, offices, and outdoors. In this paper, inspired by the whole brain architecture (WBA) approach, using a whole brain probabilistic generative model (WB-PGM), we introduce the idea of building an integrative cognitive system that can alternatively be referred to as artificial general intelligence (AGI) (Yamakawa, 2021).
A cognitive architecture is a hypothesis about the mechanisms of human intelligence underlying our behaviors (Rosenbloom, 2011). The study of cognitive architecture involves developing a presumably standard model of the human mind (Laird et al., 2017). It integrates a wide range of cognitive capabilities: representation and memory, problem-solving and planning, learning, reflection, interaction, and the social aspects of cognition. Notably, interaction includes perception, motor control, and the use of language. Social aspects of cognition include intention sharing, emotional expression, collaborative control, and language use for communication and collaboration. In the literature on cognitive and developmental robotics, a robot is required to integrate a wide range of sensor information and perform a variety of cognitive tasks to explore environments, grasp and handle objects, and interact with people (Vernon et al., 2007(Vernon et al., , 2016Doncieux et al., 2020;Tanevska et al., 2019). In such contexts, a cognitive architecture is required to integrate various elemental cognitive modules.
To build a cognitive architecture that addresses the functions of the entire brain, it is desirable to describe the computation of the entire brain with as few types of theoretical and computational elements (primitive structures, circuits, computational nodes, and so on) as possible (ideally, one type). The probabilistic generative model (PGM) is a strong candidate for a computational model for this purpose.
A PGM is a probabilistic description of how causes generate sensations, that is, observed data. That is, PGM is a statistical model of the joint probability distribution on observable data (Bishop, 2006). PGMs learn to predict these observations. This is also often referred to as predictive coding and the free energy principle (FEP). FEP is a powerful idea for explaining the human brain. It is a normative framework for Bayesian inference and learning of the human brain and is based on a PGM (Hohwy, 2013;Friston, 2019). According to the FEP, perception and action can be modeled as self-evidencing (as described in Section 4.1). Many types of elemental cognitive modules have been developed based on PGMs; including hidden Markov models (HMMs), latent Dirichlet allocation (LDA), variational autoencoder (VAE), and partially observable Markov decision process (POMDP) (Rabiner and Juang, 1986;Blei et al., 2003;Kingma and Welling, 2014;Thrun et al., 2005) as described in Section 3. Most PGMs can be represented using probabilistic graphical models. For a human-like developmental cognitive system, unsupervised (or self-supervised) learning is required. PGMs are known to be suitable for unsupervised learning. Supervised learning using human-annotated training data, on which most current AI systems, such as image recognition and machine translation systems (i.e, single-purpose cognitive modules), depend is not a suitable approach to achieve lifelong learning (LeCun et al., 2015;Krizhevsky et al., 2012;Luong et al., 2015). Using human-annotated training data is impractical for lifelong learning conducted by autonomous agents. In addition to being capable of unsupervised learning, PGMs representing elemental cognitive functions can be integrated to learn together (as described in Section 4). Owing to these features, we construct a cognitive architecture based on PGMs.
Developing a cognitive architecture by integrating elemental cognitive modules provide us with a large degree of freedom with combinatorial complexity. We can reduce the design space of the integrative cognitive system using the human or animal brain architectures as a reference model. In the rapidly advancing field of neuroscience, researchers are beginning to have some comprehensive knowledge of the human and animal brain architectures and their anatomy, as well as the various neural activities that take place in it. However, such knowledge is not organized in a way that is suitable for effectively constraining the design space of cognitive architectures. Therefore, using the WBA approach, we define a data format called the brain reference architecture (BRA), which is a reference model of the brain, and describe knowledge in the field of neuroscience in a standardized way suitable for that application (Yamakawa, 2021). Adjacent research areas include biologically inspired cognitive architectures (Samsonovich et al., 2016;Goertzel et al., 2010) and cognitive computational neuroscience (Kriegeskorte and Douglas, 2018), which is an interdisciplinary field of cognitive science and computational neuroscience. However, compared to the BRA, these fields have not made progress in accumulating design data in a standardized manner. In neuroinformatics (Amari et al., 2002;Pradeep et al., 2013;Crasto, 2007), which develops data and knowledge bases for neuroscience, progress has been made in experimental data on anatomical structures (Kuan et al., 2015) and physiological phenomena (Poldrack and Gorgolewski, 2017). However, no progress has been made in accumulating data such that it can be used to design cognitive and behavioral functions, as in the BRA-driven development.
However, even when we seek to build an AGI based on human-brain architectures, that is, using the WBA approach and having a reference model for the architecture, two challenges must be overcome. First, there are numerous alternative machine learning approaches to choose from to develop and integrate elemental models into a cognitive architecture. Nevertheless, if each elemental cognitive module is developed based on random machine learning approach, it will become difficult to integrate them in a coherent framework. Second, developmental cognitive architectures should be able to make all cognitive modules learn together based on the real-world sensorymotor information obtained by a robot. This is crucial for achieving lifelong learning. However, most elemental cognitive modules, that is, AI functions, have been developed independently under different design principles. To overcome this problem, a PGM-based approach is proposed in this paper to create cognitive architectures for developmental robots. We propose that developing elemental cognitive modules and integrating them based on the theory of PGMs, specifically, the SERKET framework, can solve the abovementioned problems Taniguchi et al., 2020c) (as described in Section 4.2).
Inspired by the human brain architecture, we propose an approach wherein a cognitive architecture is built by integrating PGM-based cognitive modules to fully mimic the human cognitive system. The framework for the integrative model is called WB-PGM. This perspective paper describes the approach and the rationale, the current status, and the future challenges of WB-PGM. The remainder of this paper is organized as follows. Section 2 describes the rationale and design principle for our approach including its background. Section 3 reviews the relevant literature on elemental cognitive modules that can constitute the WB-PGM. Section 4 describes the PGM-based cognitive system integrating multimodal information (i.e., the world model) and the SERKET framework, which enables the efficient integration of PGM-based elemental cognitive modules. Section 5 describes a possible example of a WB-PGM and the future path of the development, interdisciplinary communication, and collaboration, specifically the fusion of AI and brain science. Section 6 summarizes the conclusions.

Rationale and Design Principle for WB-PGM
To develop a cognitive architecture for realizing embodied AGI, we require a theoretical and practical development framework that enables researchers to efficiently design integrative artificial cognitive systems capable of lifelong learning, similar to the human brain. Such systems must integrate multimodal information and developmentally learn numerous cognitive skills. The proposed WB-PGM is a development framework that can contribute to overcoming these problems. This section describes the perspective of the WB-PGM and the rationale behind it.

WB-PGM
To develop a multimodal cognitive system, beyond the development of machine learning algorithms and signal processing modules that emphasize performance of a specific task, it is necessary to consider the following points: 1) machine learning algorithms (theory) considering the correspondence with the structure of the brain, 2) realization of the whole structure rather than only a part, 3) interaction between the body, including the sensors and the real environment, 4) developmental learning considering timeline, and 5) consideration of computational and energy efficiency. Here, we describe the ideas, current trials, and challenges of the WB-PGM for developing a holistic approach.
Several related studies have investigated function-based modelling of the entire brain. For example, Eliasmith et al. (2012) proposed a neural architecture, "Spaun," which models the entire brain function. This study aimed to elucidate the mechanism by bridging the gap between the neuron response and function as a whole. Sagar et al. (2016) developed a system, "BabyX," that simulates the entire brain functions. A sensor-motor system using a functional model of the brain could generate realistic facial expressions. These studies aimed to realize a holistic model, similar to our study. However, simulating the complexity of the brain functions as a whole has not been realized. This is because these systems are difficult to extend. In addition, they have focused on limited tasks.
Further, an appropriate cognitive architecture for developmental robotics should enable a robot to perform a wide range of tasks in the real environment with its body through interactions and long-term learning capabilities.
The idea behind the WB-PGM is to combine the recent developments in generative models from the field of machine learning with the latest knowledge of the structure and functional level of the entire brain. Thus, it aims to reproduce the flexible cognitive functions of humans, which cannot be achieved by the current single-purpose-oriented AI, which is often built on "discriminative" models. Moreover, neuroscience, which tends to study partial regions and specific functions of the brain (with a so-called worm's-eye view), may not efficiently grasp the whole structure at once. Therefore, adopting the bird's-eye approach that can be operated/verified as a whole using our proposed WB-PGM model may help realize AGI.
However, mapping the structure of the entire brain onto a machine learning model is not an easy task. This is because the neuroscience knowledge required to build it is vast and complex. Further, it is extremely difficult for a particular individual to design software for the entire brain. To solve this problem, we propose to use BRA-driven development that has been cultivated in the WBA literature (see Section 2.3). One of the most significant challenges facing the WB-PGM approach is matching the WB-PGM and BRA. Currently, efforts to match them are underway (Taniguchi et al., 2021a) (see Section 2.4.3 as an example of this effort). Fig. 1 illustrates an overview of the WB-PGM.  Figure 1: Constructing WB-PGM based on the functional structure of the brain via BRA. Please note that this figure is intended to provide an overview of the WB-PGA development process. Model refinement using evaluation process (see Section 2.5) is also considered.

WB-PGM
Miyazawa et al. have developed a representative prototype of an integrated cognitive model (Miyazawa et al., 2019a,b) based on the structure proposed by Doya (Doya, 1999) instead of using the BRA. Hence, the prototype is based on the hypothesis that the cerebellum, basal ganglia, and cerebral cortex are specialized in supervised, reinforcement, and unsupervised learning paradigms, respectively (Doya, 1999). Our first focus is on the unsupervised learning paradigm in the cerebral cortex, which is realized by PGMs that map observations to latent variables. This paradigm is considered as a basic module, while the reinforcement learning (RL) module corresponding to the basal ganglia and the motor control part of the robot that can be considered as supervised learning corresponding to the cerebellum are connected (Miyazawa et al., 2019b). Furthermore, in the prototype, HMM is connected as the temporal learning mechanism to enable the robots to perform planning for longer periods of time using dynamic programming such as the Viterbi algorithm (Miyazawa et al., 2019a). Various prototypes of WB-PGM, inspired by the above basic hypothesis of the brain, were implemented in an actual robot to enable basic learning (Araki et al., 2013;Nishihara et al., 2017;Miyazawa et al., 2019b).
For example, Nishihara et al. (2017) showed that a robot can actually learn object concepts and associations among concepts and words through interactions with humans using many real objects. Miyazawa et al. (2019b) extended this model further and showed that the robot can learn to use objects through its own experiences. Moreover, Taniguchi et al. (2020a) showed that the robot can acquire the concept and name of a place through interactions with its human counterpart.
To deal more closely with the anatomical structure of the brain, we need to proceed with coordination between the PGM-based cognitive architecture and BRA (see Sections 2.3 and 2.4). This process also leads to the challenge of scaling-up the combination of PGMs to achieve a large-scale WB-PGM (see Sections 3) and 4

PGMs for Cognitive Systems
The idea that the human brain functions can be described using various machine learning modules is unprecedentedly popular. In particular, PGMbased machine learning modules learn to predict observations. Improving this prediction capability is a general criterion for mathematical models of cognitive systems. PGMs have been used to explain brain functions in many contexts from a Bayesian perspective.
For example, it is believed that the hippocampus performs simultaneous localization and mapping (SLAM) (Ball et al., 2013;Tolman, 1948;O'keefe and Nadel, 1978). From a theoretical viewpoint, to perform SLAM, it is often assumed that actions, states, and observations follow a POMDP, and a map is defined as a global parameter of an observation model. Meanwhile, the Bayesian inference on the POMDP, a type of PGM, is regarded as a function of localization and mapping.
The hypothesis that the basal ganglia are responsible for reinforcement learning is also widely accepted (Barto, 1995;Montague et al., 1996;Doya, 2007). Conventionally, RL is formulated as a problem in which an agent optimizes its policy function to maximize the expected cumulative future rewards in an environment modeled as an MDP, a type of PGM (Sutton and Barto, 2018). However, the concept of control as probabilistic inference clarifies that RL can be converted into an inference problem on an extended PGM for MDP (Levine, 2018) (see Section 3.3). This also inspired us to model brain functions using PGMs.
In a series of studies on symbol emergence in robotics, many unsupervised learning systems for robots have been developed based on PGMs to help observe multimodal sensory signals, model environments, acquire languages, and adopt behaviors (Taniguchi et al., 2016b). Multimodal latent Dirichlet allocation (MLDA) is a basic example of such a process (Nakamura et al., 2012). The MLDA was able to integrate visual, auditory, haptic, and linguistic sensor information and form object categories without supervision. Many variants and extensions of the MLDA, including the nonparametric Bayesian extension, have been proposed (Nakamura et al., 2011b,a). A series of studies have suggested that a PGM-based approach has the potential for building an integrative cognitive system for developmental robots.
Advancements in deep learning have extended the capability of PGMs by inducing deep PGMs (DPGMs) (Kingma and Welling, 2014;Goodfellow et al., 2014;Suzuki et al., 2016). DPGMs use deep neural networks as part of the generative process. They can extract features, that is, they are capable of representation learning. In the 2010s, most PGMs proposed in conventional studies to develop integrative cognitive systems in symbol emergence in robotics were based on classical models of probabilistic distributions such as categorical, Gaussian, Wishart, and Dirichlet distributions (Taniguchi et al., 2016b). However, these conventional integrative PGMs were not capable of feature extraction. To tackle this problem, DPGMs can enable us to develop more flexible cognitive systems, maximizing the representation learning capability of deep neural networks.
Based on this evidence, we argue that a PGM is a reasonable approach to describe the whole brain cognitive system.

Overview of BRA-driven development for WB-PGM
As already mentioned in Section 2.1, to build a versatile cognitive system similar to a human, the entire system needs to be designed on a large scale. However, since that design space is vast, it is advantageous to constrain the design space by mimicking the architecture of the brain. As a way to embody this, we use BRA-driven development extended for PGM. BRA-driven development is a method of constructing software that reproduces human-like cognitive functions by referring to the neural circuits of the entire brain (Yamakawa, 2021). BRA data, which play a central role in BRAdriven development, form a reference model consisting of a brain information flow (BIF) , which extracts mesoscopic-level anatomical knowledge related to human cognitive behavior, and a hypothetical component diagram (HCD), which shows the structure of functional components organized consistently with respect to the BIF, as shown in Fig. 2.
The BRA-driven development process consists of the construction and evaluation processes, mediated by BRA data. In the conventional construction process, BIF is first designed by the structure-constrained interface decomposition (SCID) method described below, HCD is created under the constraints of BIF, and then, brain-type software is implemented using the HCD as specification information. In the current WB-PGM development, a process called GIPA is added to convert HCD into PGM before implementation (see Section 2.4). In contrast, the evaluation process consists of "adequacy evaluation," which confirms that the BRA is consistent with existing brain science findings, and "fidelity evaluation," which evaluates whether the brain-type software is implemented in a manner consistent with the BRA.
As shown in Fig. 2, BRA-driven development separates the design of the BRA from the implementation of the software based on it, thus enabling multiple brain scientists and multiple software developers to collaborate on large-scale development.
BRA data (Sasaki et al., 2020) essentially consist of BIF and HCD. More precisely, findings from neuroscience describe not only BIF, which is anatomical structural information, but also neural activity and the processes that constitute it, which are omitted in this description. BIF: A BIF is an information flow diagram that describes the anatomical structure of the entire brain at the mesoscopic level, without assuming any specific task . It is constructed by analyzing findings from neuroscience, such as data and connectome (Negishi et al., 2019). It consists of nodes, called circuits, and directed links, called connections (see Fig. 3 (A)). Consequently, BIFs can provide the basis for an architecture that combines numerous computational mechanisms. To prevent too fine granularity, the lower bound of granularity is defined as a population of (sub-)types of neurons that are considered almost homogeneous and belong to the same brain organ. This population of neurons is called a uniform circuit (see Fig. 3 (A)).  Figure 2: WB-PGM development as extended BRA-driven development. A development method that extends BRA-driven development, which utilizes BRA data such as BIF and HCD, by adding GIPA, thus creating PGMs. The upward direction is the construction process, while the downward direction is the evaluation process.
HCD: An HCD is a diagram that breaks down the task or function performed by a particular region of interest (ROI) in the brain into its functional components, where the structure of the functional decomposition must be consistent with the mesoscopic level anatomical structure (see Fig. 3

(B)).
A component diagram is a major type of diagram in the unified modeling language for modeling the structure of object-oriented software (Bell, 2004). It illustrates the static aspects of the operating principles of software through a network of components that perform computational functions and the semantics of the dependencies among those components in any complex system. As demonstrated subsequently, the PGM developed in this study is based on the component diagram.

Construction of WB-PGM
The construction process of the WB-PGM consists of the design process of BRA by the SCID method and the conversion process from HCD to PGM by GIPA, as described below.

SCID method for designing BRA
The SCID method is a protocol for designing HCD, which is brain-inspired software specification information, based on knowledge from the neuroscience field (Yamakawa, 2021;Fukawa et al., 2020;Yamakawa, 2020). The method constructs a BIF for a specific ROI based on anatomical knowledge and then designs an HCD that can realize the top-level function (TLF) of the ROI consistent with the BIF.
The SCID method consists of the following three steps. However, Step.1 and Step.2 are often executed back and forth. Conventional computational neuroscience models neural activity in the brain by interpreting it based on changes in the external world. However, the range of neural activities that can be interpreted in the brain is limited. However, the SCID method can organize the structure of functions to achieve its goal, consistent with anatomical knowledge of the brain, at the mesoscopic level. Since this anatomical knowledge is now being accumulated over a relatively wide region of the brain, the SCID method can potentially be used to create HCD over a wide region of the brain.

GIPA for mapping HCD to PGM
This section provides a foothold for building a probabilistic graphical model representation of a PGM that is consistent with the BRA data . As mentioned in Sections 2.3 and 2.4.1, the HCD designed using the SCID method represents a dependency structure between components corresponding to each specific brain region. Thus, to construct a probabilistic graphical model corresponding to a brain neural circuit, the dependency interfaces in the HCD should be classified such that they correspond either to the generative or inference processes. This task is called GIPA (Taniguchi et al., 2021a). Here, the links in both the generative and inference processes can be considered to be specialized dependencies. In other words, GIPA should be performed for every interface (Taniguchi et al., 2021a).
Problems: In preparation, the dynamic recurrent property should be discussed with respect to PGMs and brain structures. First, there are many loops in the brain's anatomical circuits, whereas a PGM needs to be a directed acyclic graph. In most cases, it is difficult to assign acyclic PGMs to brain circuits 1 .
The PGM is a model that provides a consistent link structure in the data generation process using directed links that represent the signal transfers between random variables 2 . When inferring latent variables, an inference model is used to calculate the posterior probability distribution of the latent variables conditioned by the observed values. In ordinary PGMs, signal propagation in the inference process causes signals to propagate in the opposite direction to the links used in the generation process. In contrast, in brain neural circuits, signal propagation between the regions by electrical spikes that propagate terminally on axons is essentially unidirectional. Therefore, to realize a PGM in its normal form, the condition, "whenever there is a connection between two regions, it is a mutual connection," must be satisfied in the brain. However, in most regions of actual brain neural circuits, satisfying this condition is difficult.
Solution strategy: To avoid this problem, we adopt an amortized inference to define the link structure of the inference process independently of the link structure of the generative process (Gershman and Goodman, 2014). The amortized inference is a type of variational inference, an approach that introduces functions for efficient approximate inferencing of latent variables. A typical example of this is the VAE, which is an auto-encoding variational Bayesian model (Kingma and Welling, 2014). Modeling using amortized inference is illustrated in Fig. 3 (C). The model used for the amortized inference can be designed with a high degree of freedom, as long as it is consistent with the link structure of the generative process. Therefore, in this type of probabilistic graphical model, it is easy to relate the link structure to the actual structure of the brain neural circuits.
The major interarea connections of the neocortex can be allocated to either of the generation or inference processes. In the neocortex, there is a feedforward pathway that transmits signals from lower to higher areas while processing signals received by sensors, and a feedback pathway that transmits signals in the opposite direction (Markov et al., 2014(Markov et al., , 2013) (see Table 1). In computational neuroscience theories, such as the Bayesian brain Friston, 2012) and predictive coding (Rao and Ballard, 1999), inference and generation are assumed to be processed by the feedforward and feedback pathways, respectively. The HCD for the neocortical interarea connections was designed based on these findings (Yamakawa, 2020). During GIPA in the neural circuits adjacent to the cortex, care must be taken to avoid inconsistencies in GIPA at the interface with the cortex. From the internal state to the outside world (from higher to lower areas) Laminar on cortical microcircuits (Markov et al., 2014(Markov et al., , 2013 Layer 3, and Layer 4 Layer 2 Meaning of signals (Yamakawa, 2020) Observation Prediction HCD associated with the above BIF. The diagram is constructed using the SCID method to assign functions to the components that correspond to the BIF and perform SLAM functions. r: Cluster information regarding positions; H: Pattern separation/completion, information integration; X and X : Self-posture; g: Prediction at the future time regarding movement/speed amount or posture; R POR : Allocentric visual information; u: Rotational speed movement. (C) A probabilistic graphical model created from the above HCD. Probabilistic graphical models are converted from an HCD by GIPA, that is, assigning all interfaces on that HCD to either generative or inference processes. The flat arrow with ∆t indicates the generation of the variable in the next time step.

Construction example: PGM for hippocampal formation
An example of PGM by applying GIPA to the hippocampal formation (HPF) 3 (Taniguchi et al., 2021a) is illustrated in this section. As shown in Fig. 3, first the BIF of the hippocampus and entorhinal cortex ( Fig. 3 (A)) is constructed. Then, the HCD ( Fig. 3 (B)) is designed using the SCID method. The PGM on HPF ( Fig. 3 (C)) is constructed by performing a GIPA. The dotted arrows represent the inference process, while the lined arrows represent the generative process.
The connection from POR to MEC II superficial on the BIF can be regarded as a feed-forward pathway. According to Table 1, an inference process can be allocated to the connection from variable R POR to variable X in the probabilistic graphical model representation. Similarly, the connection from the deep MEC to the POR on the BIF can be regarded as a feedback pathway. According to Table 1, a generation process can be allocated to the connection from variable g to variable R POR in the probabilistic graphical model representation. Nevertheless, there exist limitations to performing GIPA in terms of the inside of the hippocampus when considering only the connectivity with the neocortex. Therefore, the engineering formulation of the SLAM modeled as PGM is used as a reference.
The graphical model representation of the PGM for HPF in Fig. 3 (C) is constructed to be consistent with SLAM's PGM using the GIPA procedure based on the above discussion. For engineering SLAM, we consider a PGM that estimates the future self-posture X(t + 1) directly from the current selfposture X(t). In contrast, the self-posture X at the next time in HPF is generated via variables such as H, r, X , and g. Since the probabilistic graphical model representation of the HPF in Fig. 3 (C) degenerates over time, there is a circulation in the generation process, at a glance. To make it clear that circulations with time advancing, (e.g., POMDP and state-space models such as the Kalman filter) are acceptable for PGMs, the notation "next time generation process" is introduced. Generation with one-time step progress is represented by a double line orthogonal to the generation arrow, plus the symbol ∆t. Note that there is arbitrariness of position at which the time progress can be allocated in the loop of PGMs 4 .

Evaluation of WB-PGM
To ensure that the developed software mirrors the brain efficiently, the evaluation method proposed for BRA-driven development (Yamakawa, 2021) can be used. It comprises the two evaluations described below.
The first method entails evaluating the software adequacy by estimating the consistency between existing neuroscientific findings and BRA. The second method is the fidelity evaluation, wherein the reproducibility of the BRA in the brain-inspired software is evaluated.

Evaluating adequacy
Adequacy evaluation can be divided into that for BIF and that for HCD.

1) Adequacy evaluation of BIF
The consistency of the anatomical structures and neural activity described in the BIF with those described in neuroscientific papers and data is evaluated.
Two main inspection criteria are used to verify that the description of BIF is sufficient. The first criterion is ensuring that the description element of the structure or phenomenon that is provided in the data submitted for registration is not already registered in the BRA database (novelty). The other criterion is that the element must be directly or indirectly supported by current neuroscientific findings (authenticity). As a rule, the authenticity of facts is guaranteed by their inclusion in one or more peer-reviewed articles.

2) Adequacy evaluation of HCD
The functionality of the HCD and its consistency with the BIF are evaluated to determine whether the process generated by the behavior of the structured components in the HCD can achieve the goals of the ROI.
The consistency evaluation determines whether the HCD corresponds to the description of the BIF according to two aspects: 1. The dependency structure of the HCD corresponds to the anatomical structure contained in the ROI of the BIF. 2. The behavior of the components within the HCD is consistent with the physiological findings described in the BIF.

Evaluating fidelity
The biological plausibility of brain-inspired software is evaluated by comparing it with BIF and HCD in the BRA data. The estimated degree of consistency between the software and BRA is referred to as the fidelity.
To date, four methods have been explored for the evaluation of fidelity.
• Structural similarity: An evaluation of how strongly the static structure of the software matches the BIF in the BRA.
• Functional similarity: An evaluation of how strongly the behavior of a particular component that is implemented during the execution of a specific task matches the behavior (e.g., behavior timing) that is designed in the HCD in the BRA.
• Activity reproducibility: An evaluation of how effectively the behavior of a certain variable in the internal components of the software implemented according to the BRA reproduces the characteristics of neural activity (e.g., activity timing and patterns in the corresponding brain region during the execution of a specific task).
• Performance: An evaluation of the performance and ability of the software as a whole (integrative testing).
Among these evaluation methods, structural similarity and performance are easy to use for the evaluation of the whole software. Nevertheless, functional similarity and activity reproducibility are useful for unit tests for each component as well as for integrative development. Furthermore, it is possible to consider an evaluation method wherein dysfunction states are induced by intentionally destroying/ablating parts of the software and comparing them with the brain functioning under conditions such as mental illness or brain injury.

WB-PGM for Lifelong-learning Robots (AGI)
At some point in the future, BRA may be able to cover the entire brain, the PGM-based design that is being developed based on BRA may also cover entire brain, and the construction of WB-PGM may help in realizing lifelonglearning robots, that is, AGI. A PGM-based approach is suitable for developing an integrative cognitive architecture for lifelong-learning robots because it enables integrative cognitive systems to perform unsupervised learning that does not require human-annotated data for training. Since PGM-based cognitive systems learn internal models, representations, and behaviors by inferring latent variables of the system, they can maximize the marginal likelihood of sensory-motor observations, with the learning process requiring only sensory-motor information (i.e., performing unsupervised learning). This process is also called predictive coding. In particular, the WB-PGM is expected to be able to adapt the entire cognitive system for lifelong learning by exploiting findings from the fields of neuroscience and cognitive science in its development.
Many scientific fields attempt to reach a better understanding of the human mind, including cognitive science, neuroscience, robotics, with each field adopting a unique approach. In particular, cognitive architectures are often regarded as the standard model of a human-like mind (Laird et al., 2017).
The study of cognitive architectures has a long history, with architectures proposed, including Soar, ACT-R, Sigma, and their variants (Laird, 2012;Rosenbloom et al., 2016;Anderson, 2009). Traditional models were based on symbolic AI, whereas more recent approaches such as neural networks and PGMs have also been considered for modeling cognitive architectures. However, although most studies on cognitive architecture adopted a top-down theoretical approach, they were rarely implemented and tested on embodied artifacts in real-world physical and social scenarios such as cognitive and social robotics. That is, they were not based on real-world multimodal information, unlike modern AI (e.g., deep learning-based pattern recognition and synthesis as well as concept formation and representation learning (LeCun et al., 2015;Fadlil et al., 2013).
Therefore, the proposed WB-PGM should be validated by evaluating it on real-world tasks. As human intelligence has evolved throughout the history of environmental adaptation, our cognitive systems were developed to survive in a real-world environment, involving physical and social interactions. Consequently, a human-like cognitive system, that is, AGI, should be evaluated in a real-world environment where the human cognitive system evolved. In other words, if some tasks cannot be achieved in a real-world environment by a developed AI system, then the corresponding aspects of human intelligence are missing. Thereby, a cognitive system should learn various skills by organizing multimodal sensory-motor information observed by the system itself. For this purpose, the cognitive system ought to have a body to actively and autonomously explore the physical and social environment. That is, it should be tested using robots that have bodies to act in a real environment. This is part of the performance test for evaluation fidelity mentioned in Section 2.5).
From the viewpoint of brain science, where researchers seek to understand how human minds and brains work in greater detail, a standard model, such as the WB-PGM, can provide top-down guidance to interpret actual exper-imental results, which is always obtained from a partial cognitive process and under limited conditions. Furthermore, such a model can suggest new experiments for efficiently uncovering the mystery of the human brain and cognition. Therefore, the cognitive architecture in this study is located at the intersection of neuroscience, AI, and robotics. There is strong evidence that the proposed WB-PGM can also extend neuroscience studies.

PGM-based Cognitive modules
The WB-PGM is being developed by integrating cognitive modules into a single PGM. A wide range of elemental cognitive modules has been developed with regard to their respective cognitive capabilities. This section provides a survey of the PGM-based cognitive modules to further integrate them to realize the WB-PGM.
As mentioned in the Introduction, if the entire brain-like cognitive structure can be constructed as a unified model called PGM, its development will be more efficient. However, the question remains as to whether each part in the brain can be treated as a PGM. Here, we outline that the computational elements of each area of the brain involved in higher-order cognitive functions can be represented generally as a module of the PGM. Historically, information processing, primarily in the visual cortex, has been modelled as a PGM. However, all neocortical areas are composed of canonical microcircuits with homogeneity to some extent (Douglas et al., 1989;Bastos et al., 2012;Beul and Hilgetag, 2014), and the network between neocortical areas constitutes a counter stream of feedforward and feedback systems (Markov et al., 2013). As this mechanism allows us to regard the flow of observational and predictive signals as opposing (Yamakawa, 2020), it acts as the basis for the PGM mechanism for the whole brain. Recently, it has also been pointed out that the PGM mechanism in the neocortex has the duality of cognition and control (Doya, 2021) (See also Section 4.1). Additionally, it is reasonable to consider the hippocampus, which is connected to the neocortex via the entorhinal cortex, as a PGM because it performs computations similar to SLAM, which is essentially described as a PGM (Taniguchi et al., 2021a). Further, the basal ganglia and amygdala can estimate the desirability of a certain state from the system's viewpoint, and the neocortex can use this information to perform optimal control as a PGM (see Section 3.2). Moreover, the cerebellum can also be thought of as a mechanism that accelerates the PGM by partially extracting the computations performed by the neocortex and basal ganglia and quickly performing alternative computations (see Section 3.3).

Visual perception and representation learning
The mammalian brain has a minimum of two processing modules for each sensory modality, one for action and the other for recognition or consciousness (Ungerleider, 1982;Goodale and Milner, 1992;Sakagami and Pan, 2007). For example, the visual information in the retina is sent to the primary visual cortex (V1), which includes two cortical pathways. The dorsal pathway from V1 to the parietal cortex determines the spatial layout of objects and computes their disposition for actions. The ventral pathway from V1 to the inferotemporal cortex mediates object recognition and contributes to the formation of our cognitive world. Anatomical data further reveal that the parietal cortex projects primarily to the premotor and dorsolateral prefrontal cortices (DLPFC) (Pandya and Seltzer, 1982). Specifically, the reciprocal connection between the parietal cortex and DLPFC contributes to spatial attention and spatial working memory (Constantinidis and Klingberg, 2016). In contrast, the inferotemporal cortex has many efferents to the prefrontal cortex, especially the ventrolateral prefrontal cortex (VLPFC) (Ungerleider et al., 1989). In particular, the projection from the inferotemporal cortex to the VLPFC appears to be critical for concept formation (e.g., categorization) and generation of new information (deductive inference) (Pan et al., 2008;Tanaka et al., 2015). Recently, Bengio et al. (2013) have demonstrated that their machine learning algorithm could successfully simulate representation learning in the ventral pathway.
In representation learning, a good representation entails generalizability to arbitrary tasks, with various hypotheses proposed as properties that such a representation should satisfy (Bengio et al., 2013;Goodfellow et al., 2016). Inspired by the idea of human concept formation, one of the most important proposed hypotheses is disentanglement (Higgins et al., 2016), which holds that each element of a representation should be semantically meaningful. For example, if we have a picture of a cat, we observe that the picture consists of various meaningful elements, such as the cat type, its orientation, and the position of the light source.
Further, scene interpretation corresponds to the ventral pathway in the brain. It is the study of recognizing the images of multiple objects using VAEs in an unsupervised manner to decompose the disentangled representation corresponding to each object. In this study, the models are designed to assume several latent variables corresponding to the objects and generate decomposed images from them (Fig. 4). Eslami et al. (2016) proposed the attend-infer-repeat (AIR) approach, which decomposes the latent variables of each object into what the object is and its location in the image and infers the latent variables of the objects in the image. The AIR approach can recognize and reconstruct each object end-to-end on images of multiple handwritten digits. Kosiorek et al. (2018) extended the AIR approach temporally by introducing latent variables that correspond to objects that have been present since the previous step and those that only just appear in the current step. Furthermore, to properly decompose images with multiple complex objects, an approach was introduced, whereby a mask is employed for each object as a latent variable, which is also inferred in the recognition process (Greff et al., 2019;Burgess et al., 2019;Engelcke et al., 2019).

Value and Reinforcement
The reward and value system is fundamental to the survival of a biological system in a given environment. In the RL theory, a value function is defined as the expected sum of future rewards. Neuroscience studies have revealed several brain areas that play major roles in RL, including the amygdala and basal ganglia. These areas receive strong projections of dopaminergic neurons, which have been demonstrated to encode reward prediction errors (Schultz et al., 1997). The projection from the cortex to the basal ganglia has distinct forms of plasticity, depending on the presynaptic input and postsynaptic spike output, followed by the dopamine input, that is, the dopamine-dependent plasticity (Reynolds et al., 2001;Yagishita et al., 2014;Iino et al., 2020). Owing to these observations, RL models of the cortico-basal ganglia circuit have been proposed (Barto, 1995;Montague et al., 1996;Doya, 2007). A specific hypothesis, which has been supported by neural recording experiments (Samejima et al., 2005;Pasquereau et al., 2007;Lau and Glimcher, 2008;Ito and Doya, 2015), is that the neurons in the basal ganglia learn state and action value functions (Doya, 2000). The circuit of the basal ganglia is composed of multiple pathways starting from the striosome and matrix compartments in the striatum (Yoshizawa et al., 2018) and the direct and indirect pathways downstream (Hikida et al., 2010). Incidentally, recent RL algorithms for robust and efficient performance use multiple types of value functions (Haarnoja et al., 2018a;Wang et al., 2020), possibly hinting at the need for multiple pathways in the basal ganglia.
In addition, effective RL critically depends on the representation of states and actions. The cerebral cortex provides multimodal, hierarchical representations of states and actions through unsupervised representation learning and inference of hidden variables (Doya, 1999). Although backpropagation in a deep Q-network solves the problem of value-oriented representation learning, it is known to be highly data-demanding (Lake et al., 2017). Furthermore, although representation learning in the cortex appears to be unsupervised, experimental observations suggest that learning is modulated by reward or value signals (Bao et al., 2001;Seitz et al., 2009). A recent study that applied variational recurrent neural networks to RL demonstrated that task-critical latent variables can be learned (Han et al., 2020b).
Moreover, to achieve fast learning and fine control, it is important to select the right level of abstraction. The amygdala and the cortico-basal ganglia circuit appear to form a hierarchical RL system. The evolutionarily old amygdala is crucial for immediate actions for primary reward and punishment. The amygdala is composed of a cortex-like lateral part and a basal ganglia-like central part (Cassell et al., 1999), which may be seen as a prototype cortico-basal ganglia circuit. The cortico-basal ganglia circuit is composed of multiple parallel loops: the limbic loop through the ventral striatum, the prefrontal loop through the dorsomedial striatum, and the motor loop through the dorsolateral striatum (Voorn et al., 2004;Ito and Doya, 2015;Balleine et al., 2015). These parallel loops appear to form a hierarchical RL system, spanning different levels of abstraction (Haber et al., 2000;Voorn et al., 2004;Ito and Doya, 2015;Balleine et al., 2015). Determining the right set of action options and their combinations is an active area of research (Bacon et al.;Han et al., 2020a).

Action planning and control
Although model-free RL provides a generic solution to control problems, learning requires many trials, and evidence suggests that humans deploy model-based strategies using action-dependent state transition models or forward models (Wolpert et al., 1998). The classic framework for model-based optimal control is dynamic programming based on the Bellman equation (Bellman, 1952). The similarity between the equations for optimal state inference and optimal control is known as the Kalman duality (Kalman, 1960), which indicate the similarity between the computation of the log posterior in dynamic Bayesian inference and the state value function in optimal control (Todorov, 2008;Levine, 2018;Doya, 2021). The framework is recognized as "planning as inference" (Botvinick and Toussaint, 2012) or "control as inference"(CaI) (Levine, 2018).
Levine (2018) introduced a binary optimality variable that indicates whether the state and action at each time in the MDP are optimal and formulated the reward function as the probability for the optimality variable to take one (Fig. 5). In the CaI framework, we can derive the entropy regularized expected reward objective by performing variational inference for the optimality variable at all times, from which we can derive the soft actor-critic (SAC) (Haarnoja et al., 2018b). In addition, we can derive an iterative planning method based on the inference of the plan, that is, a series of actions, instead of policy optimization, as in the SAC. Okada and Taniguchi (2020b) demonstrated that the difference between various planning methods can be generalized as the choice of the posterior distribution in the inference of optimality variables. This idea was extended to POMDP settings as well (Okada et al., 2020).
In the cerebral cortex, while the posterior half is mostly involved in sensory inference, the anterior half is mostly involved in control and planning. Thereby, the CaI framework can provide an answer to the basic question of why common circuit architectures can be used for both inference and control (Doya, 2021). An important difference between the sensory cortex and the motor or frontal cortex, in addition to the thickness of different layers, is that the latter receives inputs from the cerebellum and the basal ganglia through the thalamus. Further, while probabilistic dynamic models and value computation are realized in the cortical circuit, these subcortical circuits may provide useful shortcuts. The cerebellum has been proposed to provide deterministic forward models learned by supervised learning (Wolpert et al., 1998;Doya, 1999), which can supplement probabilistic models in the cortex. Additionally, the basal ganglia can provide learned value functions (Doya, 1999;Daw et al., 2005) to complement online computations of value functions. The network linking the cortex, basal ganglia, and cerebellum is involved in motor learning Tanaka et al. (2018) and action planning (Fermin et al., 2016b), and their exact roles are a topic of active research.

Spatial cognition and mapping
In neuroscience, it has long been assumed that the HPF, consisting of the hippocampus and entorhinal cortex is responsible for functions such as episodic memory, spatial cognition, and response inhibition. Interestingly, memories are transferred to the neocortex through the phenomenon of memory replay and consolidation during sleep.
Thus, HPF has various functional aspects. However, for the following reasons, it would be useful to learn from the brain by focusing on spatial cognitive functions that play an important role in the navigation of mobile robots. This is because the SLAM technology (Thrun et al., 2005;Uchiyama et al., 2017), which combines the functions of self-position estimation and map formation, has been formulated as PGM. Thereby, this function can be naturally incorporated as part of the WB-PGM. Furthermore, as described below, there has been extensive neuroscientific research on spatial cognition related to HPF, mainly in rodents.
Further, involvement in spatial cognitive abilities in the hippocampus has long been considered responsible for cognitive maps (Tolman, 1948;O'keefe and Nadel, 1978). There have also been epoch-making discoveries of spaceencoding cells such as place, border (Solstad et al., 2008;Savelli et al., 2008), head-direction (Taube et al., 1990a, and grid cells. Place cells are neurons that are active in specific locations within the hippocampus (O'keefe and Nadel, 1978), while grid cells are neurons in the MEC that are cyclically active as rodents and other animals move through space (Hafting et al., 2005).
Essentially, spatial cognitive abilities involve the transformation of selfcentered (egocentric) information obtained directly from sensors into a representation of world-centered (allocentric) information. In particular, grid cells contribute to the representation of this world-centered coordinate system. Since the beginning of the 21st century, research on the relationship between the HPF and the posterior parietal lobe, which represents self-centered and world-centered information, has been conducted (Whitlock et al., 2008;Wilber et al., 2014Wilber et al., , 2017. Subsequently, we discuss the evolution of the SLAM technology from a practical perspective. The SLAM-related mathematical theory and implementation have made rapid progress in the last decade. SLAM models can be represented by PGMs based on the POMDP. PGM-based SLAM models are estimated based on Bayes filters such as landmark-based (Montemerlo et al., 2002) and grid-based (Grisetti et al., 2007) FastSLAM.
Furthermore, the semantic mapping approach, which includes the meaning of places and objects, has been actively developed as the next direction of SLAM (Kostavelis and Gasteratos, 2015) due to its effectiveness in performing human-robot interaction tasks. Specifically, it is important to appropriately generalize and form place categories while dealing with observation uncertainties. To address these issues, PGMs for spatial concept formation have been constructed (Taniguchi et al., 2016aHagiwara et al., 2018;Katsumata et al., 2020). Taniguchi et al. (2017) proposed the spatial concept formation using SLAM (SpCoSLAM), that is, place categorization and mapping through unsupervised online learning from multimodal observation. SpCoSLAM is an integrated PGM composed of SLAM, a Gaussian mixture model, a multimodal Dirichlet process mixture model (MDPM), and speech recognition, as shown in Fig. 6. Katsumata et al. (2020) successfully transferred global spatial knowledge related to multiple environments to a new environment by integrating the spatial concept model with generative adversarial networks (GANs). This approach to spatial concept formation was also adopted for tasks in the World Robot Summit (El Hafi et al., 2020;Taniguchi et al., 2021b). We consider the above models as candidates for a cognitive module with functions similar to the HPF. Details are discussed in Taniguchi et al. (2021a).  , which has self-position, environmental map, position distribution, multimodal place categories, word sequences, and language models as latent variables. The blue, red, green, and orange areas represent SLAM, the position distribution, the multimodal place categorization, and speech recognition and word segmentation, respectively. Using a Rao-Blackwellized particle filter procedure (Doucet et al., 2000), SpCoSLAM can infer these model parameters and latent variables.

Social interactions and inference
Social behavior is an individual's behavior that arises from the interaction with others under certain circumstances. Specifically, humans have a special characteristic in their prosocial behavior, that is, the tendency to benefit others by sacrificing their own profit. It is assumed that human prosocial behavior is calculated in the cortical model-based process (Knoch et al., 2006). However, Rand et al. (2012) demonstrated that human subjects tend to exhibit more prosocial behavior intuitively while being more selfish in deliberation. Furthermore, Fermin et al. (2016a) indicated that the subcortical areas, particularly the amygdala, play an important role in prosocial behavior, whereas cortical areas, specifically the prefrontal cortex, are critical for pro-self behavior. Yamagishi et al. (2017) suggested that the prosocial bias, which was related to the dependency of model-free or modelbased processes, varied across individuals. Thus, it can be conjectured that certain social habits are acquired under stereotyped stimulus-response circumstances through RL of model-free systems, whereas others are mediated by goal-directed calculation of model-based systems. However, many neural mechanisms underlying social behavior remain unclear.
Except for some aspects that will be subsequently explored, social communication is a broad concept that has not been sufficiently investigated from the angle of PGMs.
Nevertheless, as social communication involves estimating and understanding the intention of others, intent estimation can be modeled as an inference of the latent variables of others to predict their behavior. In other words, since if we estimate a person's intention, we can predict their behavior more accurately, intention estimation can be modeled as a prediction problem to some extent. In the early days following the postulation of this idea, Wolpert et al. (2003) argued that intention estimation can be modeled using multiple forward-inverse models. However, the discussion can be reinterpreted with PGMs in a more sophisticated manner. Intention is regarded as a latent variable within a system, that is, the inference of the latent variable can be regarded as intention estimation.
However, if a person attempts to model each person's behavior by assuming latent variables, the complexity of the model increases. For example, when we play football, we apparently do not predict the players' behavior one by one. Thus, PGMs that enable a cognitive system to predict the behaviors of a group of people and conduct a cooperative task are required.

Speech recognition and language
Language is a representative fruit of the evolution of human species. It has enabled us to transfer knowledge and form communities and societies through communicating complex information by combining linguistic symbols using our vocalization system. An important aspect of the human language is that the symbolic system is not directly encoded in the biological gene (genome); rather, it is encoded in a cultural gene (meme). Knowledge is transferred using language. Further, language has non-biological and non-physiological effects on the development and function of the brain (Deacon, 1998).
In particular, speech recognition and generation are fundamental parts of our spoken language. The motor theory of speech perception, which is well-known and widely debated in cognitive and brain sciences, argues that we use generative models of speech signals per utterance (Liberman et al., 1967;Galantucci et al., 2006;Laurent et al., 2017). In other words, it claims that the objects of speech perception are the speakers' vocal tract gestures.
According to this brain theory, a PGM-based approach is suitable for speech recognition and language.
Before deep learning-based approaches became dominant, speech recognition and synthesis were conventionally studied using PGMs (Zhang et al., 2017). The HMM, a type of PGM for time-series data, has been widely used for speech recognition systems (Rabiner and Juang, 1986;Lee and Kawahara, 2009). Generally, a speech recognition system is composed of an acoustic model and a language model. The acoustic model mimics the acoustic features of speech signals for phonemes. It is a generative model of acoustic features. In contrast, the language model represents the generative process of word sequences. A word corresponds to a sequence of phonemes. This two-layer hierarchy for speech generation is typically applicable to spoken language and is termed "double articulation." In particular, the hierarchical Dirichlet process-hidden language model (HDP-HLM) is a total generative model that involves language and acoustic models within a unified PGM (Fig. 7). Taniguchi et al. (2016c) proposed HDP-HLM and derived a blocked Gibbs sampler for the PGM. It was demonstrated that the machine learningbased method can perform simultaneous phoneme and word discovery from only speech signals.
The syntactic nature of language has also been studied from the viewpoint of generative models for a long time. Probabilistic models for generative grammar, such as probabilistic context-free grammar and combinatory categorial grammar, assume that there exists a latent tree structure behind a sentence, that is, a word sequence (Liang et al., 2007;Bisk and Hockenmaier, 2013). Thereby, inferring the latent structure corresponds to parsing in syntactic analysis.
Recently, a neural network-based approach to natural language processing has become dominant, with many GAN-and VAE-based methods for speech signal processing developed (van Niekerk et al., 2020;Kameoka et al., 2018). Such DPGMs involving neural networks can exploit the advantages of deep learning for speech signal processing. Although HMM-like PGMs have become less popular in the late 2010s, this does not mean that a PGMbased approach is not valid, as the mathematical framework of PGMs involves DPGMs as well. Specifically, to leverage non-annotated data, unsupervised learning methods based on PGMs and self-supervised learning methods are promising. Self-supervised learning methods for language, for example, BERT and GPT-3, exhibit remarkable performance (Devlin et al., 2018;Brown et al., 2020). In addition, when considering the integrative cog-  Figure 7: Graphical model of HDP-HLM (Taniguchi et al., 2016c), which has language, word, and acoustic models as latent variables. Using a blocked Gibbs sampling procedure, HDP-HLM can infer these models and latent sequences of words (i.e., latent words) and phonemes (i.e., latent letters).
nitive model involving not only speech recognition but also spoken language acquisition, (e.g., phoneme and word discovery), the PGM-based approach is still promising.
Nevertheless, developing a cognitive architecture to achieve such linguistic communication remains a difficult challenge in brain and cognitive science, AI, and robotics (Tangiuchi et al., 2019). For example, a service robot is required to understand human utterances whose meaning is not always explicitly determined. If speaker suggests or implies with an utterance using an implicature, such as, "it is too hot," he or she actually implies "please turn on the air conditioner" or "please wait a while until the coffee becomes less hot." In pragmatics, people believe that the meaning of an utterance is context-based. The context could involve a speaker, place, situation, and habit. Such information is not encoded by the utterance itself. Therefore, an assumption in which an utterance and meaning are regarded as the input of a function and output, respectively, is not plausible, as contextual information is provided by the multimodal sensor information and a history of interactions. This suggests that language understanding inevitably involves latent variables, and PGMs can model the cognitive system that facilitates linguistic communication. Nevertheless, the social aspects of language are not limited to what we have described above. Although language is a social phenomenon, it is clearly realized by human cognitive systems.
In summary, the cognitive architecture that enables social interaction (see Section 3.5) and language processing (see Section 3.6), which are described in this section, require a complex coordination of cognition and behavior. Such higher abilities are beneficial to the brain for learning. Given this situation, if all computational elements can be unified into PGMs, the efficiency of design and implementation is expected to improve.

Integration of cognitive modules
The WB-PGM is developed by integrating a wide range of elemental cognitive modules. This section describes the PGM-based approach to integrating cognitive modules. First, we revisit machine learning-based methods for world models that involve representation learning and the integration of multimodal information, including action and sensation, using PGMs. The FEP, which provides a unified view of biological perception and behavior based on this PGM-based world model, is also introduced. Finally, we introduce the SERKET framework, which allows us to develop an integrated PGM-based cognitive system, that is, WB-PGM, by combining elemental PGM-based cognitive modules.

World Models and FEP
Humans can construct a mental model of the world by recognizing and learning from information obtained from the external world in a self-supervised manner. This model, which represents our hypothesis about the world, can be used as a simulator to predict the unknown or the future based on current observations. Furthermore, by incorporating our own behavior into this model, we can predict the future in the long term. A framework that realizes these human functions in machine learning is called a world model (Ha and Schmidhuber, 2018;Hafner et al., 2019bHafner et al., ,a, 2020. The key to realizing a world model is to compress the vast and multimodal information of the external world in a spatio-temporal manner to obtain their latent representations. If such representations can be appropriately acquired, future predictions can be made easily by transitioning through representations on the time direction. In addition, by assuming the structure of the world as an inductive bias for the model before learning, we can acquire a world model with higher generalizability more efficiently. PGMs are good tools for designing a world model that meets these requirements. Specifically, in the case of a time-evolving environment, POMDP models are often assumed because input stimuli are considered to be partial observations of the environment. The generative process of observations designed by PGMs corresponds to predictions, with the inference of representations from those observations regarded as perceptions from the external world. In the interarea signal transmission of the neocortex, the pathway responsible for prediction is called feedback. It is involved in the generative process in the PGM. In contrast, the perceptual pathway is called feed-forward and is involved in the amortized inference process in PGMs (see Table 1).
The world model has been studied as a model of the environment in model-based RL. However, early attempts to "make the world differentiable" using RNNs Schmidhuber (1990) were not applied to large-scale environments because they were not equipped with representation learning techniques. Ha and Schmidhuber (2018) introduced a recurrent neural network (RNN)-based model that combined VAEs to enable spatial abstraction, and trained it on the trajectories, that is, the time series of images and actions, of complex game environments. They proved that agents that were reinforcement-trained only on the world model behaved appropriately in realgame environments. Therefore, we can obtain a model of the world that has sufficient predictive power for good representation learning, similar to mental imagery training in humans.
Recently, Planet (Hafner et al., 2019b) and Dreamer (Hafner et al., 2019a(Hafner et al., , 2020, which are model-based RL methods with more sophisticated deep generative models based on POMDPs, have shown high sample efficiency and performance in long-term control tasks based on a series of images in the environment. Moreover, the uncertainty in the latent state representation of these models conveys the model's beliefs about the world; thus, the amount of information gleaned by the model from the environment is directly correlated to its level of certainty. Gregor et al. (2019) introduced a decoder based on this world model on a two-dimensional map and demonstrated that although the generated map is blurry in the early stages of exploration in an unknown environment, it becomes more accurate with experience.
Further, the FEP is a promising theory that provides a unified view of biological perception and behavior based on a PGM-based world model. Von Helmholtz (1867) hypothesized that perception is the inference in an internal model. Subsequently, the Bayesian brain hypothesis and predictive coding were considered as works to minimize prediction errors or "sur-prises." Friston (2005); Friston et al. (2006) extended these models with insights from statistical machine learning and thermodynamics, arguing that decision-making, as well as perception, is unified within a framework of variational inference or free energy minimization.
In addition, Friston et al. (2010Friston et al. ( , 2015 proposed active inference, in which the organism selects a sequence of actions to minimize the expected free energy, that is, the expected value of the free energy with respect to an observation, considering the observation as a latent variable, i.e., an unknown to be obtained in the future. This framework is similar to that of CaI (Levine, 2018), in that it considers decision-making as inference, However, they differ in the way they consider "preferences" for states and actions (Millidge et al., 2020). CaI introduces preference as a new random variable (i.e., optimality) in the model, while CaI introduces preference as a new random variable (i.e., optimality) into PGMs, active inference considers that preference is encoded as the bias of PGMs. Under this assumption, the terms corresponding to the extrinsic and intrinsic values are derived from the expected free energy. The extrinsic value refers to the value of exploitation, which encourages agents to be goal-oriented, whereas the intrinsic value refers to the value of exploration, which encourages agents to act to obtain novel observations. Therefore, based on only the expected free energy of active inference, we can derive the term corresponding to the trade-off between exploitation and exploration in RL.

SERKET and Multimodal Integration
Similar to the human brain, a robot's cognitive system must integrate many types of sensory-motor information, find relationships between them, and utilize the organized internal representations. That is, the cognitive system must be very complex. MMLDA was proposed by combining many PGMs of MLDA (Fadlil et al., 2013). Specifically, MLDA was combined with the nested Pitman-Yor language model (NPYLM), thus obtaining MLDA+NPYLM 5 to achieve unsupervised word segmentation using multimodal object category information formed by the MLDA (Nakamura et al., 2014). SpCoSLAM integrates PGMs for multimodal categorization, SLAM, and automatic speech recognition systems, which involves the word discovery capability. Thus, it achieves multimodal categorization and lexical acquisition about places . By maximizing the trajectory probability based on the CaI framework, path planning based on semantic information can be conducted using PGMs, that is, SpCoSLAM. This method is called SpCoNavi . These results clearly demonstrate that the PGM-based approach has the flexibility of integrating a broad range of cognitive modules and the capability to make them learn together.
The benefit of cognitive architectures integrating multimodal sensorymotor information is evident. First, when a unimodal signal does not have sufficient information and suffers from noise or uncertainty, the system can find latent structures (e.g., object categories and word units) using multimodal sensory-motor data (Taniguchi et al., 2018aNakamura et al., 2012Nakamura et al., , 2014. Second, many cognitive functions, including localization using images and utterances, language understanding, and action planning, can be realized as a cross-modal inference within such multimodal cognitive architecture (Fadlil et al., 2013;Taniguchi et al., 2017Taniguchi et al., , 2020b. Finally, owing to the second feature, we can construct multi-purpose and system-oriented cognitive architectures rather than a single-purpose or goal-oriented function. Nevertheless, when attempting to develop a large-scale cognitive system involving a wide range of cognitive functions of the human brain, developing efficient large-scale computational models based on PGMs becomes a critical problem. The use of a probabilistic programming language (PPL) is a possible solution. Many types of PPLs that enable the efficient development of PGMs have been proposed (Goodman et al., 2012;Tran et al., 2017;Bingham et al., 2019;Sato and Kameya, 1997). However, when developing a WB-PGM using only a PPL, it is necessary to re-implement each cognitive module using the PPL. However, this may cause a reusability problem from the viewpoint of software development. Thus, we need a development framework that allows us to use elemental PGMs developed in a heterogeneous and distributed manner.
SERKET is a framework for integrating PGM-based cognitive modules and developing cognitive architectures involving multiple cognitive functions . Although building integrative cognitive systems using PGMs is a promising approach, designing an inference procedure for the developed PGMs sequentially is highly tasking. Further, modern integrative cognitive models for multimodal concept formation and language acquisition by robots (e.g., MMLDA and SpCoSLAM) involve many nodes; thus, variables and inference procedures are proposed for each model. However, we found that most of such integrative PGMs can be regarded as composites of several elemental cognitive modules, that is, PGMs, whose inference procedures have already been developed. SERKET provides a protocol with which the inference procedure of an integrative PGM is divided into inference procedures for each elemental PGM and communication among them. In addition, Neuro-SERKET is a natural extension of SERKET (Taniguchi et al., 2020c) in that although SERKET does not support neural networkbased PGMs, that is, DPGMs, Neuro-SERKET supports them. Further, Neuro-SERKET allows elemental cognitive modules to learn in a heterogeneous manner. Each module uses different learning methods. For example, two modules trained using Gibbs sampling and variational inference can be integrated. Practically, this flexibility helps people develop an integrative cognitive system using pre-existing and distributed cognitive modules. In the deep learning-based approach, differentiability is required throughout the system. In this sense, each cognitive module must be homogeneous from the viewpoint of optimization. In contrast, the SERKET framework allows us to use heterogeneous cognitive modules, that is, the learning and inference processes of each module can be encapsulated. This characteristic can improve the reusability of the elemental cognitive modules in a practical sense. Fig. 8 shows an overview of the SERKET framework. Generally, PGMs have three types of connections: (a) head-to-tail, (b) tail-to-tail, and (c) head-to-head). A complex PGM can be decomposed into two modules at the shared node (i.e., z in Fig. 8). In the inference phase, the internal variables in each elemental PGM can be updated independently, and the shared node can be updated by exchanging probabilistic information (i.e., posterior distributions conditioned by observations). The right side of Fig. 8 shows the possible decomposition of SpCoSLAM (Section 3.4). The total PGM of Sp-CoSLAM can be decomposed into SLAM, GMM, MDPM, and an automatic speech recognition system. Communication-message passing-between elemental cognitive modules enables the whole cognitive model composed of many elemental modules, that is, the proposed WB-PGM composed through the SERKET framework, to be trained throughout the integrative system. For more details, please refer to the original papers Taniguchi et al., 2020c).
From the viewpoint of the practical implementation of the SERKET framework, there is the question of how to realize the communication protocol. Using a PPL or a library, for example, Pixyz (Suzuki et al., 2021), is a promising approach. In addition, SERKET can reduce developmental efforts to create an integrative cognitive system. For example, Taniguchi   (Nakamura et al., 2014;Nishihara et al., 2017), and (c) model for location concept and language acquisition .
consider this as another approach to constructing cognitive models, in which the information is inferred 154 through hidden parameters. answering (Wu et al., 2016). In these studies, end-to-end learning is used, and this makes it possible to 159 infer information from other information. Therefore, these are also considered part of the cognitive model 160 we defined in this paper. However, as we mentioned in Sec. 2.1, we believe it is important for robots to 161 understand the real environment by structuring their own sensory information in an unsupervised manner.

162
To develop a cognitive model where the robots learn autonomously, our group proposed several models 163 for concept formation (Nakamura et al., 2007), language acquisition (Nishihara et al., 2017;Taniguchi 164 et al., 2017Taniguchi 164 et al., , 2016b, learning of interactions (Taniguchi et al., 2010), learning of body scheme (Mimura 165 et al., 2017), and learning motor skills and segmentation of time-series data (Taniguchi et al., 2011;166 Frontiers 7 Figure 8: Three types of connection of PGMs and their decomposition in SERKET framework (left). In the development phase, each PGM was developed in a distributed manner.
In the inference phase, the modules work together by exchanging probabilistic information as an integrated cognitive system. For example, SpCoSLAM (Section 3.4) can be decomposed into four elemental modules (right).
et al. (2020c) showed that an unsupervised machine learning system that categorizes raw image data and speech signals simultaneously can be developed quickly and efficiently. In particular, using the Neuro-SERKET framework, a complex learning system can be developed by connecting pre-existing modules (e.g., VAE, Gaussian mixture model, LDA, and automatic speech recognition system). However, the possible integrative cognitive architecture has a huge degree of freedom to pick up elemental cognitive modules, connect them, and enable them to work and learn together. This is a design problem of cognitive architectures. . Since the human brain is an excellent example of an integrative cognitive system that can work in a real-world environment and perform a wide range of complex tasks, learning the human brain architecture is a good approach to reduce the complexity of the design problem Thus, the WB-PGM should be interpreted from the viewpoint of the brain architecture. Such interpretability allows researchers to identify what is missing and what is implemented. It also facilitates communication between AI and robotics researchers and brain and cognitive science researchers.

Current status of WB-PGM
This section describes the current status of the development of WB-PGM including elemental modules and integration. Future issues are also discussed.
As mentioned in Section 2.3 and 2.4, to create a PGM of the entire brain, it is crucial to construct a BIF of the entire brain, design the corresponding HCDs, and then run GIPA to create PGMs. For hippocampal formation, data have been created for the PGM (Taniguchi et al., 2021a) 6 . For the interconnections between the neocortex, thalamus, and basal ganglia, we plan to proceed with BIF and HCD data creation based on the existing research results (Yamakawa, 2020). For eye movement, the claustrum, basal ganglia, and cerebellum, we are currently constructing a BIF. To cover the entire brain, other brain regions, including the amygdala, midbrain, and pons, need to be designed as well. Owing to the physical structure of BIF, it is expected to converge to a stable content once sufficient knowledge of neuroscience is accumulated. Depending on the diversity of human cognitive functions, multiple HCDs will be described on a specific region of BIF data; thus, these HCDs should be integrated prior to creating PGMs as possible.

Primitive structure in WB-PGM
The WB-PGM has a primitive function structure, similar to the basic layered structure of the cerebral cortex. As mentioned above, the WB-PGM uses a PGM as a primitive structure (i.e., a cognitive module), which corresponds to the "circuit" in the BIF. From a computational viewpoint, we consider that PGMs can be classified into three generations.
The first generation is a PGM described in the probabilistic generative process, which is the most basic structure. Model learning is a parameter estimation problem for probability distributions and is realized using Gibbs sampling or variational inference (Nakamura et al., 2007).
The second generation is a DPGM represented by VAE, which replaces the parameter estimation of the probability distributions with learning through neural networks. Learning is realized by maximizing the evidence lower bound using a neural network (Kingma and Welling, 2014;Suzuki et al., 2016).
As for the third generation, there is a structure that uses a self-attention mechanism and self-supervised learning. The self-attention mechanism has attracted attention owing to its extremely high performance in natural language processing (NLP) techniques, such as transformer, BERT, and GPT (Vaswani et al., 2017;Devlin et al., 2018;Brown et al., 2020). The application of BERT to multimodal expansion and RL is advancing (Miyazawa et al., 2020). In particular, multimodal BERT may become the basic module of the third generation. Contrastive learning, which is a representative approach of self-supervised learning, enables neural networks to perform representation learning without supervision or explicit definition of the generative process. This approach is used in a variety of tasks, including visual recognition and RL (Chen et al., 2020;Srinivas et al., 2020;Okada and Taniguchi, 2020a).

Connection of the building blocks
How is the overall structure constructed by combining primitives (i.e., cognitive modules)? By integrating the modules, a path is opened toward the construction of a large-scale representation learning system that imitates the cerebral cortex. In fact, an integrated system is achieved by Miyazawa et al. (2019b); however, it is not yet sufficiently large. Thus, modules for vision, speech, and language are incorporated into the system. Furthermore, an RL module and a temporal module for long-term temporal planning are connected via the latent space held by the representation learning module. To connect building blocks, the SERKET framework is used as described in Section 4 Taniguchi et al., 2020c).
When connecting primitives, time management poses a challenge in the course of connecting building blocks to configure the entire structure. To handle information flexibly, it is important to consider the physical characteristics of each signal and the time scale, according to how it is used. Since the brain processes information on a multi-scale and asynchronous basis, it is necessary to realize such an integration mechanism. Recently, Taniguchi et al. (2021a) abstracted a neural activity phenomenon called phase precession into a mechanism called discrete-event queues when constructing the PGM of the HPF. This type of brain-inspired research could help us handle time in PGMs.
In the current prototype of WB-PGM (Miyazawa et al., 2019b), the latent variable for perception level (i.e., lower level) is integrated by a latent variable at the higher level. Furthermore, the model is a simple mechanism for executing long-term action plans by modeling the temporal relationship on latent variables at the higher level 7 .

Implementation
Implementation of the WB-PGM is ongoing. Miyazawa et al. (2019b) integrated a representation learning module, a language learning module, and an RL module using Neuro-SERKET, and demonstrated the possibility of a real robot learning language and behavior simultaneously (Fig. 9). The example was developed using the Neuro-SERKET framework (Taniguchi et al., 2020c). As mentioned above, Neuro-SERKET can easily implement complex structures and optimize the entire structure by combining modules. Note that (Neuro-)SERKET allows an integrated cognitive system to be trained as a whole throughout the system (as mentioned in Section 4.2). This is because SERKET decomposes the Bayesian inference of every latent variable of the integrated cognitive system into intra-module probabilistic inference ( i.e., training) and inter-module communication (i.e., message passing). In addition, it can incorporate an existing speech recognition module and deep learning models.
Despite the success of the implementation, we are at a preliminary stage of the development of the WB-PGM. Ideally, the WB-PGM should involve whole-brain cognitive modules and functions; however, current examples of PGM-based integrative cognitive architecture only involve very limited brain functions.
When considering the realization of the WB-PGM, there are problems to be considered in relation to brain functions, such as the range of influence and scheduling of inference based on submodules and hierarchy. Thus, learning human brain architecture can be beneficial in solving these problems as well.

Future perspectives
We have described the current status of the WB-PGM in the previous subsection. However, numerous aspects must be considered.
As described in Section 3, a wide range of cognitive modules have been developed for functions corresponding to brain regions, which can be used to develop an integrative cognitive architecture. However, the exploration of the • HMM(Pla1mingl selection and integration of modules remains a future challenge. Furthermore, the current Neuro-SERKET framework does not support third-generation PGMs (i.e., cognitive modules based on self-supervised learning explained in Section 5.1) . Thus, future research should extend neuro-SERKET to a framework that can integrate all types of PGMs. Regarding time management discussed in Section 5.2, it is important to study the technology through which various modules process information at different time intervals and asynchronously/non-linearly and integrate them. One approach would be to incorporate an integrated execution platform (e.g., BriCA) (Takahashi et al., 2015), which factors the asynchronous nature of the brain, into PGMs.
Trough fidelity evaluation (see 2.5), the precision of the correspondence between the WB-PGM and BRA should be increased because the resolution of the correspondence between the current brain structure and WB-PGM is still coarse. It is crucial to consider the problem of dividing the area, such as how to manage time in the model and how to infer at a particular time. Further, considering missing elemental modules are also crucial.
In the future, it is expected that BRA-driven development specialized for PGMs, will bring about refinements in the evaluation of the implemented brain-inspired software. The GIPA process was introduced to convert and build HCD into PGM, as shown in Figure 2 in Section 2.5. However, PGMs are not considered in fidelity evaluation, which estimates the conformance of the software to HCD. Therefore, it would be useful to add software to evaluate fidelity to PGM, PGM fidelity to HCD, and the fidelity of software implemented in PGM to HCD.
It is also important to introduce a developmental perspective. For that purpose, it will be necessary to construct a physical body that can constantly work and learn and even grow. In this respect, the use of soft robots can be promising and may lead to realizing lifelong learning. This lead us to the suggestions on endowing robots with the following capabilities: 1. learning of causality: generalizability, sample efficiency 2. emotions (maintaining its own body): from self/other to sociality 3. creativity: social value, imagination 4. explainability: communication 5. consciousness: global workspace, meta cognition, qualia In our future work, we will develop the WB-PGM by addressing these issues

Conclusions
In this study, we proposed WB-PGM, an approach to develop an integrative cognitive architecture for developmental robots based on brain-inspired PGMs. PGMs and their inferences can learn knowledge from sensory-motor observations without manually crafted annotation/label data. Unlike most modern AI systems, biological cognitive systems, especially the human brain, can acquire a wide range of cognitive capabilities without supervision. We argue that a PGM-based approach is promising for the development of an integrative cognitive architecture. Although previous PGM-based integrative cognitive systems for developmental robots have been proposed, their cognitive capabilities were limited. Furthermore, elemental cognitive modules were introduced or discussed in relation to PGMs. Subsequently, we hypothetically described a prototype of integrative cognitive architecture.
Building a WB-PGM has two advantages. First, it can serve as a reference for brain studies. The PGM describes explicit informational relationships between variables, that is, internal representations. This description provides interpretable guidance from computational sciences to brain science. By providing such information, researchers in neuroscience can provide feedback to researchers in AI and robotics on what the current models lack with reference to the brain. Our WB-PGM approach can facilitate discussion and collaboration among researchers in neuro-cognitive sciences as well as AI and robotics.
We admit that world models and the FEP are general and critical ideas for developing next-generation AI, which has to be integrative, autonomous, and developmental. However, current studies related to these theoretical ideas are mostly limited to simple problems and experiments in a simulation environment. However, to study cognition, it is crucial to test hypothetical ideas in a real-world environment. Developing robots and making them perform practical tasks in real-world environments are important processes for the exploration of next-generation AI. To this end, top-down engineering of cognitive architecture is required. A theory-based practical implementation of artificial cognitive systems in robots is crucial.
We argue that referring to the WBA, that is, the structure of the human brain, is beneficial for developing a cognitive architecture. However, naturally, it may be argued that such brain structures should emerge from data learned by a large neural network. Current success in very large language models, for example, GPT-3 and BERT, seems to corroborate this idea. We do not have an answer to this question; however, such a general approach is not feasible for realizing AGI. The WB-PGM is a more promising approach to develop a cognitive architecture for a developmental robot that admits the current technological background. In addition, even if we can develop a meta-cognitive system that can facilitate the emergence of whole brain structures, the BRA and WB-PGM will be a good reference for evaluating the emerged structure.
The emphasis in this paper was on PGMs, which are basically directed graphs. Another type of Bayesian network is Markov networks, which are undirected graphs. We selected PGMs for two reasons. First, latent variable models based on PGMs, for example, LDA, HMM, and VAE, have been successful at modeling cognitive systems and exhibiting good properties from engineering and practical viewpoints. Second, in generative models, arrows represent predictions, and the characteristics fit the idea of predictive coding. However, this is not a strict argument. It is acceptable to partially introduce a Markov network into the proposed WB-PGM.
In addition, it may be argued that the human brain has a recursive struc-ture and bilateral connections. However, in a probabilistic graphical model representing a PGM, a directed arrow connecting two variables does not imply the absence of an inverse connection between two nodes, as the arrows represent the generative process and not the physical connections. In the inference process, an inverse information stream should be considered. This point is clearly underscored by amortized inferences connecting nodes inversely in inference networks. In this sense, when we compare probabilistic graphical models and the neural connections in the human brain, it is necessary to differentiate generative processes from the inference ones.
Taniguchi, A., Taniguchi, T., Inamura, T., 2016a. Spatial concept acquisition for a mobile robot that integrates self-localization and unsupervised word discovery from spoken sentences. IEEE Transactions on Cognitive and Developmental Systems 8, 285-297.