A Framework and Operational Procedures for Metaverses-Based Industrial Foundation Models

Industrial processes are typical cyber–physical–social systems (CPSSs), where the effective management of employees and the efficient control of machines play important roles. Traditional industries heavily rely on human labor and neglect the development of collection–utilization–transmission integrated information loops, thereby leading to high costs and low efficiency in operational procedures. To facilitate the natural interactions and smart operations for humans and machines, industrial foundation models (IFMs) based on metaverses are proposed in this article, serving as the operating systems of industrial parallel machines that provide sustainable data resources and scenarios for management and control experiments. On this basis, IFM comprised of vision foundation models, language foundation models, as well as operational foundation models, are constructed to manage resources in industrial parallel machines and provides comprehensive services for industrial procedures. On the one hand, IFM can efficiently manage various resources including computing power, digital assets, enterprise resources, and platform I/O via the proposed CPSS-based competing, sharing, scheduling, monitoring, allocating, and recovering mechanisms. On the other hand, imaginative intelligence, linguistic intelligence, and algorithmic intelligence can be achieved through vivid visualization of vision foundation models, natural conversations of language foundation models, and smart manipulation of operational foundation models. With the proposed IFM, cyber–physical–social intelligence (CPSI) can be achieved to enhance the efficient management and control of industrial processes.


I. INTRODUCTION
I NDUSTRIAL production plays significant roles in the economy and our daily life. It is a complex system involving the management of human labor, the scheduling of resources as well as the control of machines. With the development of new technologies, the industry is gradually moving from automation to intelligence. The expansion of industrial production is hampered by a number of issues, though, as industrial processes have increased and complicated.
Industrial processes often lack interaction or are difficult to interact with, resulting in inefficient execution. This can lead to production delays, increased costs, and a decline in quality.
Interactions between different parts of the process are essential to optimize the overall flow of work and ensure that tasks are completed in an efficient and effective manner. In order to ensure that industrial processes are executed efficiently, the designer and implementer need to take into account the various interactions between different parts of the process. This is a difficult task, as it requires a great deal of knowledge and experience. As a result, organizations may miss opportunities to improve their operations.
Management of resources has always been exceedingly challenging, particularly when it comes to managing mental labor. Low efficiency of mental labor and management difficulties are brought on by measurement challenges. It is famously difficult to quantify mental labor, which makes it challenging to manage efficiently. This leads to low efficiency and productivity among mental laborers, as they are often not sure how much work is actually expected of them. As a result, it is challenging to efficiently regulate and optimize mental labor. Businesses that significantly rely on mental work may experience decreased productivity and increased costs as a result of this inefficiency. In addition, mental labor is often subjective and difficult to quantify, making it difficult to compare different workers and identify areas in which they can be improved. Employers could struggle to defend the expense of mental labor as a result, and employees might feel undervalued and overworked [1].
With the development of intelligent technologies, there have already been some small models applied in some specific industrial scenarios, but the small models with simple functions fail to meet the needs of enterprises for intelligent interaction and management. First, small models can only handle separate tasks in a decentralized and independent manner. The inability to work collaboratively between small This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ models also means that more efficient and general intelligence is not possible. Second, small models are often not capable and accurate enough to handle large-scale data. The management of various resources requires the processing of large amounts of structured and unstructured data, which cannot be supported by the model capacity of small models.
The emergence of metaverses [2], [3], [4] and foundation models provides an effective solution to improve industrial processes by providing a way to visualize and analyze huge amounts of data in a more immersive and interactive way. With metaverses and foundation models, companies can improve their understanding of complex processes and make more informed decisions, at the same time, the value of mental labor can also be accurately measured through data-driven analysis. Additionally, it is possible to train employees on new procedures in metaverses or to simulate and test in metaverses before implementing improvements in the real world.
This article proposes a framework and operational procedures for metaverses-based industrial foundation models (IFMs) in cyber-physical-social systems (CPSSs) [5], which is composed of vision foundation models, language foundation models, and operation foundation models. IFM is capable of coordinating and managing the activities of multiple resources in order to optimize production. It is the foundation of the future industrial economy and will have a profound impact on the way we live and work. With cognitive intelligence, parallel intelligence, crypto intelligence, federated intelligence, social intelligence, and ecological intelligence (6I), IFM aims to achieve the safety, security, sustainability, sensitivity, service, and smartness (6S) of enterprises, and the process is measured by safety index, security index, sustainability index, sensitivity index, service index, and smartness index [6].

II. BIG AI MODELS
With large-scale datasets, global modeling frameworks, unsupervised learning approaches as well as efficient computing platforms, developing big AI models have become possible and popular in various tasks, such as natural language and visual data processing [7]. Overall, the shifting of AI models from small-scale ones to big ones is about the extension of generalization ability and the reconstruction of the learning paradigm. First, big models have superior ability in the generalization to the application on different scenarios and the recognition of different patterns thanks to the increasing data scale and the widely adopted Transformer framework with global and dynamic aggregation mechanisms. For the reconstruction of the learning paradigm, big AI models have led the way of model development to the pretraining-finetuning paradigm from traditional integrated and individual model development. Different from the fully supervised manners used in conventional training processes, the pretraining stage takes advantage of self-supervised learning approaches to alleviate the labeling burdens and learn intrinsic embeddings from the input itself.

A. NLP Models
Transformer [8], the sole attention-based architecture, is first proposed in the NLP area, which has then become the main framework used for constructing NLP big models. Typical structures of big NLP models include encoder-based models, decoder-based models, and encoder-decoder models which are paired with different training strategies and used for different tasks. Encoder-based NLP models, such as BERT [9], UniLM [10], XLM [11], ELECTRA [12], and so on, simultaneously consider the context features for each token and adopt autoencoding (AE) objective to train the model with masked language modeling-based self-supervision. Different from the bidirectional design in encoder-based models, decoder-based models only consider tokens before the current position and use autoregression (AR) objectives during the training process. Typical decoder-based NLP models include GPT [13], GPT-2 [14], GPT-3 [15], ELMo [16], CPM-1 [17], and so on. Encoder-based models have advantages in the modeling of context features and are widely used in language understanding. But due to the masking operations, there are gaps between the inputs at the pretraining stage and finetuning stages. Decoder-based methods adopt unidirectional designs, which are suitable for generative language tasks. XLNET [18] explores the combination of advantages from both AE and AR by token permutation. Encoder-decoder frameworks combine the representation and task-specific modules in series, which are widely used for sequence-to-sequence tasks, such as question answering and machine translation. T5 [19], Switch-Transformer [20], and BART [21] are typical encoder-decoder-based big language models.

B. Vision Models
ViT [22], DETR [23], and other pioneering works extend Transformers to computer vision areas, which has attracted great attention to the design and application of vision Transformers. Several big vision models, such as ViT-MoE [24] and Swin-v2 [25] have been developed in a supervised manner on large-scale datasets, such as ImageNet [26] and JFT-3B [27]. However, the supervised training manner relies on heavy and expensive annotations which is unfavorable for the wide application of foundation models [7]. Augmentation-based, cluster-based, and maskingbased methods are three of the main self-supervised training strategies for visual data. Augmentation-based methods conduct different transformations on the inputs and calculate the loss with consistency constraints [28], [29], [30]. Cluster-based approaches use the cluster results as pseudo labels for the supervision of classification task [31], [32]. Recent maskingbased methods adopt a similar idea from BERT [9], which randomly masks a percentage of image patches in the inputs and uses a Transformer-based network to infer the masked patches [33], [34].

C. Multimodal Models
Emerging processes on language and vision big models also inspire the development of multimodel inputs. The multimodal inputs can be images, videos, texts, speeches tabular data Industrial parallel machines based on CPSS: Basic framework, processes, and functionalities modes. as well as predefined rules. Multiple promising applications have been designed, such as text-to-image generation [35], text-to-video generation [36], text-to-3-D generation [37], image-to-video generation [38], visual commonsense reasoning [39], [40], and so on. In multimodal networks, different transformer encoders are used as the main backbones to extract the features from diverse inputs and cross-attention mechanisms are adopted for information fusion.

III. INDUSTRIAL FOUNDATION MODELS
IFMs are the operating systems of the industrial parallel machines [41], also the illusionists of industrial management. They can make bad things better, make fewer things more, make complex things simple, and make the parallel machine present to the user in a more intelligent, more convenient, and more powerful way, achieving the 6I and 6S of industrial management. At the same time, the operating systems are based on the principle of parallel management, and they are comprehensive systems that can be used to manage any type of industrial organization, from a single factory to a multinational corporation. In the future, they will be widely used in the automotive, aerospace, electronics industry, and other industries.

A. Industrial Parallel Machines
Parallel machines are the sublimation of computers into the intelligent world of the future. As shown in Fig. 1, the CPSS-based parallel machines combine the physical-designed Newton machines and the software-defined Merton machines into one, bridging the physical space, social space, and cyberspace. With parallel machines, biological employees, robotic employees, and digital employees are deeply integrated to build a new type of parallel employees, combining human and machine, knowledge and action, and virtuality and reality. Among parallel employees, human employees with intuitive rationality, i.e., biological humans themselves, who take up only 5% of the workload and are responsible for leading the organization. Robots employees with adaptative rationality, i.e., intelligent machines, take up about 15% of the workload and dominate physical work. Digital employees with computational rationality, i.e., intelligent programs and information machines, who are in charge of the remaining work and interact with biological humans and robotic humans in a natural way [1].
The parallel employees are involved in the actual operation as users, hardware, and software of the industrial parallel machines, whereas the IFM acts as the operating system of the industrial parallel machine, controlling the operation of all matters in the machine. In primitive computers, humans controlled the machines directly without an operating system. However, as the complexity of computers grew dramatically, humans were no longer up to the task. Similarly, as computers will be upgraded to CPSS-based parallel machines, the nowadays operating systems will certainly be replaced by the foundation models.

B. Framework of IFM
IFMs are operating systems for industrial parallel machines, which consist of three components, the vision foundation model for imaginative intelligence, the language foundation model for linguistic intelligence, and the operational foundation model for algorithmic intelligence. As shown in Fig. 2, CPSS and industrial metaverses are the cornerstones and scopes of industrial parallel machines and IFMs. By building a virtual world on the basis of the physical world, we can realize the virtual-real symbiosis of the whole life cycle of industries. And then, the intelligent management of enterprises and the efficient closing loop of production and sales are realized by parallel machines with the IFMs as the operating system.
The vision foundation models are responsible for the strategic planning and organization of the enterprises. Their main users are decision makers such as CEOs of companies. With increasingly intricate social relationships and the sheer scale of industry data, human decision-makers are no longer able to analyze and develop a strategic plan that outlines the company's goals and objectives. Instead, they must rely on vision foundation models that can take into account all of the relevant information and generate TopK plans then make recommendations under the direction of human decision makers. Even though these models have their limitations at the development and evolutionary stage, human decision-makers still need their help in order to make the best decisions for their companies to prevent them from being eliminated by other companies with vision foundation models.
The language foundation models are responsible for the coordination and management of the enterprises. Their main users are managers such as department heads of companies. The models are designed to help human managers in various ways, such as improving communication between departments, tracking the progress of projects, and managing resources. Especially, with the general trend of industrial automation and digitization, human managers have inefficient communication with smart devices and digital employees, so they simply cannot directly manage the actual running of the company. At this point, language foundation models can understand both natural and machine-intelligent language. Therefore, they have an innate talent to make better decisions and improve the efficiency of the department by coordinating and analyzing the relevant data obtained from other departments and combining them with the operation of their own department, based on the premise of understanding the visual foundation model for the overall planning of the company.
The operational foundation models are responsible for the execution of the enterprises. Their main users are workers such as grassroots staff of companies. The jobs of grassroots employees are often relatively repetitive, boring, and even dangerous sometimes. Based on the experience of past social developments, such as the emergence of ironware, textile machines, steam engines, electric motors, and computers, human society is bound to keep moving in the direction of simplicity and efficiency. These essential jobs are inevitably performed by various robotic humans and digital humans. The operational foundation models are designed to help human workers analyze task information from language foundation models, at the same time, they maximize the productivity of robotic humans and digital humans by helping biological humans to manage and schedule specific work.

C. Operational Procedures of IFM
As shown in Fig. 3, the management of various resources, mainly, including computing power, digital assets, enterprise resources, and platform I/O, are the four core mechanisms of the IFMs. The resource management mechanisms of the IFM mainly include a resource competition mechanism, resource sharing mechanism, resource scheduling mechanism, resource monitoring mechanism, resource allocation mechanism [42], resource recovery mechanism, and so on. Every management mechanism is equipped with its own policies, algorithms, rules, etc., that serve particular functions.
The computing power is the primary resource, and it refers to the processing power of industrial parallel machines running in the metaverses and CPSS, which can be used to handle any of the data processing, information analysis, algorithm development, and model designing involved in the industrial processes and company management. As shown in Fig. 4, with the artificial systems, computational experiments, and parallel execution (ACP) method of parallel intelligence as a guide [43], [44], [45], [46], computing management can carry out physical embodied intelligent for the enterprises in the physicspace while also carrying out computationally embodied intelligent for the enterprises in the cyberspace. With the support of sufficient computing power, scene engineering builds diverse scenarios for task generation and scheduling [47], [48], [49], [50], computational experiments, and parallel testing and execution. In particular, the generation and scheduling of tasks through large-scale operations are implemented to achieve descriptive intelligence. Descriptive intelligence can help you see the big picture of where the industry or business is now and where it is going based on massive data while generating and recommending the most appropriate development strategies or work methods. Then, the predictive intelligence is realized by computational experiments on the generated tasks and arrangement in both physicspace and cyberspace. Computational experiments rely on large-scale computing power to reason and simulate kinds of strategies generated in descriptive intelligence through multiple experiments in cyberspace and limited experiments in physicspace. Finally, we test and execute the results of the experiments in both spaces by parallel testing and parallel execution with virtual-real interaction, and then the prescriptive intelligence can be realized. Prescriptive intelligence can monitor and guide the direction of progresses in a closed-loop manner while the work is being executed, enabling the most efficient and smartest task-driven approach.  Digital assets will become the most valuable asset for enterprises in the future. Method innovation, data collection, and knowledge automation are the basis for the evolution of intelligent enterprises. At the same time, these digital assets bring not only direct economic benefits to the enterprise but also attention and trust. These virtual goods, which are originally unquantifiable, will also become new commodities that can be mass produced and circulated in CPSS. Digital assets management is mainly concerned with the production of digital assets, the circulation of digital assets, the transaction of digital assets, and the supervision of digital assets. In the management processes of digital assets, the federalization of digital assets [51] helps ensure their security, reliability, and privacy [52]. In addition, the effective management and utilization of digital assets are of great significance to the self-learning and evolution of IFMs, helping them to further become the core competence of enterprises.
Enterprise resources are the basic resources used to produce goods and services, such as land, labor, capital, and so on, which are essential to both traditional and intelligent industries. Close monitoring of the inventory of raw materials and products is essential to the company's capital chain and market potential. Physical and mental labor will always be a necessity for industrial production. Intelligent equipment is capable of efficiently completing production tasks. Enough physical and cyberspace to ensure the continuous growth of enterprises. By allocating capital reasonably, companies can make up for their shortcomings and develop more effectively. Energy and ecology are the two main constraints faced by industrial production in recent years. The implementation of new intelligent technologies and the allocations of energy and ecology indicators can help reduce the waste of energy and ecological resources in production processes, making them more efficient and maximizing their effectiveness. Overall, with the help of IFMs, production management is able to go further than robotic process automation (RPA) and achieves digital process intelligence (DPI) [53].
The platform I/O are necessary components to maintain the healthy operation of the enterprise, including human resources to update the labor, supply chain [54], [55], [56] to update the materials and products, and business agreements to update the enterprise strategy and development plan. I/O of human resources, supply chain, and commercial agreements refer to the onboarding and offboarding of employees, the procurement of raw materials and consumption of products, and the cooperation and termination of agreements, respectively. The IFMs are expandable, and they can connect and interact with each other. Therefore, cooperation and communication with various enterprises, government departments, and other entities can be accomplished with the help of IFMs. At the same time, all enterprises with IFMs can form a decentralized autonomous organization (DAO) [57], and various resources can flow in the organization autonomously and efficiently.

IV. SUPPORTS FOR INDUSTRIAL FOUNDATION MODELS
The explosive emergence and development of all kinds of advanced sciences and technologies have provided gradually rich and effective technical support for the implementation and realization of IFMs.

1) New Information Technology (Artificial Intelligence):
Artificial intelligence [58], [59], [60], [61] is the core support for the IFMs. Whether vision foundation models, language foundation models, or operation foundation models, they are all based on the new information technology, artificial intelligence [62]. 2) Knowledge Automation: Data-driven AI models contribute to industrial automation and intelligence, but lack interpretability. However, data-driven models are not always efficient and applicable. For example, when dealing with laws and theorems that seem very obvious and simple to humans, data-driven models may require large amounts of data and consume huge amounts of computing power. Knowledge automation [63] encompasses not only the modeling of traditional rules [64], reasoning, and explicit representations but also the modeling of tacit knowledge, pattern recognition, group experience, etc. [65], [66]. Building executable knowledge software systems with the help of intelligent technology and software technology can liberate knowledge workers from repetitive tasks. 3) Cognitive Sciences and Parallel Cognition: To a considerable extent, cognitive science [67] and AI have the same origin, being different statements and different aspects of the same function and goal. Cognitive science is the scientific study of the mind and its processes, specifically, the exploration of the nature, tasks, and functions of cognition in general. Parallel cognition circulates and integrates multidisciplinary and interdisciplinary knowledge in the physical, mental, and artificial worlds, with three consciousnesses in three worlds facilitating the development of IFMs. 4) Interface Technology: The ultimate purpose of IFMs is to serve biological humans, and the interaction technology is the one that humans are closest to and have the most contact with, which plays a key role in the experience of using IFMs. Human-computer interaction [68] is the focus of interaction technology, and the brain nerves are the most natural and effective means of human interaction. The brain neural recognition technology has made great strides in recent years, while text [69], speech [70], [71], and visual [72], [73], [74], [75], [76], [77] recognition technologies are slowly becoming part of our daily lives. These interaction technologies provide the necessary support for IFMs. 5) Connection Technology: The advancement of highcapacity, high-reliability, low-latency mobile communication technologies [78], [79], such as 5G and 6G has led to further development and application of technologies, such as edge-cloud cooperation and edge computing [80], [81]. By communicating closely and in real-time, simple computing based on local small models and complex computing based on remote large models can be achieved. These supports are necessary for the local-cloud collaborative deployment of IFMs and their interaction. 6) Ecological Technology: Ecological technologies, such as DAO [82] and blockchain [83], [84], [85] maintain the secure [86] and smart operation of the IFMs by building an intelligent ecosystem of enterprises. Related enterprises linked together through DAO and blockchain can interact and share resources safely and securely. At the same time, members of the organization can participate in the rule-making and governance of the organization to achieve optimal development of the enterprise itself as well as self-management and autonomous evolution of the organization.

V. CASE STUDIES
In this section, we apply the IFMs to the oil field, manufacturing, and automatic optical inspection (AOI) industries, and propose parallel oil field, parallel manufacturing, and parallel AOI, respectively.

A. Parallel Manufacturing
As the most common component of industries, manufacturing is the main engine of economic growth. With human and social factors being introduced into manufacturing systems, which are characterized by uncertainty, diversity, and complexity, the current manufacturing systems are facing some issues: excessive human interventions, lack of flexibility, inability to achieve on-demand production, and inefficient humancomputer interaction. As a new paradigm in smart manufacturing, parallel manufacturing provides a feasible solution for tackling those problems, which is comprised of cyber systems for product life-cycle management, physical systems for totally integrated automation, and social systems for manufacturing intelligence. Various manufacturing resources, including materials, equipment, robots, and humans are integrated and coordinated through Industrial Internet and Industrial Internet of Things (IoT) [87], where the knowledge automation technology is embedded to form industrial Internet of Minds (IoM).
Digital humans and robotic humans accomplish the majority of manufacturing tasks in cooperative manners, whereas biological humans focus more on creative and global tasks with the help of vision foundation models in IFM, as illustrated in Fig. 5. Vision foundation models can design products and build production processes and blueprints of enterprise based on social needs, at the same time, the models can deliver them more effectively to digital humans and other foundation models for efficient production.

B. Parallel Oil Fields
With the development and needs of human production and life, the traditional technologies for oil fields have also undergone a transformation. Based on Industry 4.0, many experts and scholars have proposed the concept of smart/intelligent oil fields (IOFs) based on digital oil fields (DOFs), which aims to increase oil production and acquire more economic or political benefits.
However, the proposal of IOFs is based on the traditional cyber-physical systems (CPSs), which does not fully consider the role of social systems [88] where the biological human is located and has certain limitations. Therefore, we applied Industry 5.0 and the IFMs in CPSS to the oil fields to produce a new model of oil field production and operation: the parallel oil fields. As shown in Fig. 6, the role of language foundation models in IFM is highlighted in the coordination and management of tasks, such as production calculation and scheduling, and equipment monitoring and maintenance in parallel oil fields. Dynamometer cards are the most important parameter schematic in parallel oil fields. Based on the dynamometer cards, the language foundations models can not only obtain information on the operation status of the pumping unit and further issue and arrange maintenance work, but also realize the scheduling and control of production by setting the target dynamometer cards of the pumping units.

C. Parallel AOI
Defect detection is an important part of the industrial production process, used to identify product quality and provide guidance for subsequent repair. LCD panels, IC carrier boards, precision PCBs, etc., have high production costs and strict requirements for quality, and defects in the production process can easily lead to the scrapping of the final product or cause hidden dangers to the use of the product. Therefore, defect detection for these products is particularly important. Early defect detection is completely manual sorting. With the development of computer vision and automation equipment, AOI equipment greatly reduces the burden of manual inspection, while improving the efficiency and stability of detection. However, in the actual PCB defect detection, in order to achieve higher detection accuracy, it is still necessary to manually double-check the PCB after the automatic defect detection equipment.
This reinspection process currently requires a lot of labor costs and gradually becomes a bottleneck in PCB production. On the one hand, workers need a lot of professional knowledge and practical experience to make a correct judgment on the type and severity of defects. On the other hand, the accuracy and efficiency of manual restoration are very low and cannot meet the needs of modern production.
As shown in Fig. 7 robotic and digital workers managed by operational foundation models in IFM assist biological workers to complete the reinspection operation can effectively improve efficiency and reduce work difficulty. Equip each biological worker with exclusive digital workers to achieve an understanding of defect information and restoration operations simultaneously through data and knowledge. The defect detection and repair work in the physical world is performed by robotic workers, and only in rare cases does a biological human intervention become necessary. In this process, IFM coordinates and manages the various resources of the enterprise, monitoring, and scheduling the work of robotic and digital humans [89], while making communication between them and biological humans more efficient.

VI. CONCLUSION
With the purpose of effective and efficient management and control of industrial processes, the framework and operating procedures of IFMs are proposed in this article to schedule various resources and achieve natural interactions, which lay the foundation for the sustainable development of enterprises. Specifically, three-level foundation models are constructed to achieve smart task comprehension, specification, and planning in the industrial processes involving both humans and machines, so as to form a new type of operating system for the parallel machines in CPSS. Three typical applications, including clothing manufacturing, oil exploitation, and optical inspection, are elaborated to explore the potential of the potential IFM. IFM can drive enterprise management to achieve cognitive intelligence, parallel intelligence, crypto intelligence, federated intelligence, social intelligence, and ecological intelligence for realizing "6S" goals of enterprises, which are safety, security, sustainability, sensitivity, service, and smartness.