Parallel Vision for Intelligent Transportation Systems in Metaverse: Challenges, Solutions, and Potential Applications

Metaverse and intelligent transportation system (ITS) are disruptive technologies that have the potential to transform the current transportation system by decreasing traffic accidents and improving driving safety. The integration of Metaverse and transportation technology, called metaverse transportation system (MTS), can greatly improve the intelligence of real transportation system. The digital models built in MTS help to simulate the full life cycle of physical entities, which equip the virtual space with controllability and flexibility. In this article, we concentrate on the field of environment perception, which is the basic function of intelligent vehicles in MTS. To overcome the poor scalability of traditional environment perception methods, we develop the framework of parallel vision for ITS in metaverse (PVITS), consisting of construction of virtual transportation space, model learning based on computational experiments, and feedback optimization based on parallel execution. This article highlights opportunities brought by PVITS in terms of model precision and generalization improvement. Then, the challenges of PVITS are discussed, i.e., distribution difference between virtual and real transportation space, structure design and theoretical interpretation of vision models, and data security and privacy in virtual transportation space. After that, we present several solutions to tackle the application challenges and fully exploit the superior characteristics of PVITS while attenuating their negative side effects. Some potential applications are also given to represent the effectiveness and reliability of PVITS.


I. INTRODUCTION
T HE INTELLIGENT transportation system (ITS) [1], commonly integrating advanced sensing, communication, and information technologies, has emerged as a critical field for promoting the efficiency, effectiveness, and safety of transportation systems, for satisfying transportation development demands [2], [3], [4]. Several different types of sensors have been applied to collect continuously generated traffic information in ITS. It is well known that visual information is more frequently applied than other types of perceptual information in practice. Due to the attractive priceto-performance ratio for image sensors, vision methods have become increasingly essential for ITS, especially for different autonomous transport devices in the ongoing transportation revolution. It can provide an accurate and timely traffic situation by developing specific computer vision techniques from the acquired visual information [5], [6], [7]. Taking full advantage of the visual information would enable a human or machine to better perceive and understand complex transportation environments.
Although significant progress has been made in computer vision techniques in ITS, researchers from both academia and industry are still facing several major challenges that hinder further advances in ITS development [8], [9], [10]. First, traditional vision methods have shortcomings in data acquisition, model learning, and evaluation. We usually only design and evaluate vision methods for some specific application scenarios or environmental conditions. It is difficult to ensure the generalization of vision methods in complex and open traffic environments [11]. Second, it is time-consuming and laborintensive to collect and label large-scale and diverse datasets from real traffic scenes. Besides, labeling data through human observation is easy to make mistakes, especially under low illumination, bad weather, and other conditions [12]. Finally, the real traffic scene is uncontrollable and unrepeatable. It is impossible to separate each component of the scene (such as weather conditions) while keeping other factors unchanged. Therefore, we cannot thoroughly analyze the impact of each component of the scene on the vision method separately.
The integration of metaverse and transportation technology, called metaverse transportation system (MTS), provides a potential way to overcome the above challenges, which greatly enhances the intelligence of real transportation system. The metaverse has the potential to extend the physical world using augmented and virtual reality technologies and allows users to seamlessly interact within real and simulated environments using avatars or holograms. Thus, the cyber-physicalsocial system (CPSS), consisting of three interacting worlds (physical, mental, and virtual worlds), can be built to provide information for users to support their decisions and enhance their performance. The mental world can be deeply integrated into the virtual and physical world. The virtual world in CPSS can realistically mimic aspects of the physical world, including various static and dynamic models. Inspired from this, MTS has two key spaces, i.e., a physical transportation space and a virtual transportation space. The virtual transportation space is constructed by simulating the physical one and large-scale and diverse data can be generated from the virtual space. This virtual big data can be guided by model learning and concentrate on producing useful and complementary knowledge.
Building upon this, the framework of parallel vision for ITS in the metaverse (PVITS) is developed to guide the accurate perception of road environment information, which consists of construction of virtual transportation space, model learning based on computational experiments, and feedback optimization based on parallel execution. In the virtual transportation space, all digital replicas could be acquired from its thorough inspection in the physical transportation space [13]. Each replica is linked with the corresponding physical entity in the real space, which can capture the most unique characteristics of the linked physical entity. Having digital replicas of physical entities leads to great benefits. First, since the entire life cycle of physical entities can be simulated in the virtual space, we can flexibly change and control different scene states by setting different variables. Thus, it is convenient to comprehensively evaluate the vision model under various scenes and then optimize it continuously, greatly improving its generalization in complex and open traffic environments. Second, we can automatically generate a large-scale and diverse dataset from the virtual space without time-consuming and labor-intensive manual labeling. Finally, the constructed virtual scene is controllable and repeatable. Each component of the scene (such as weather conditions and vehicle type) can be separated, and the impact of each component of the scene on the vision model can be analyzed separately. As for model learning based on computational experiments, we incorporate the virtual big data from the constructed virtual space and the real small data from the physical space to carry out thorough computational experiments, aiming at learning and evaluating the vision models by virtual-real interaction. As for feedback optimization based on parallel execution, we execute the models in parallel in the virtual and physical space, so that the real-time feedback optimization is carried out to realize intelligent perception and understanding of complex scenes.
Our contributions can be summarized as follows.
1) The PVITS framework is developed to overcome the poor scalability of traditional environment perception methods, consisting of construction of virtual transportation space, model learning based on computational experiments, and feedback optimization based on parallel execution.
2) We summarize and highlight the opportunities brought by PVITS in terms of model precision and generalization improvement. Meanwhile, the challenges of PVITS are also discussed from the perspective of scene distribution shift, model structure design, and privacy protection. 3) Several solutions are given to tackle the challenges and fully exploit the superior characteristics of PVITS. Extensive experiments on two typical perception scenes (collaborative perception and object detection) represent the effectiveness of PVITS. The remainder of this article is organized as follows. Section II presents the background of parallel vision and metaverse. The framework of PVITS is elaborated in Section III. The challenges and solutions are analyzed and presented in Sections IV and V. Section VI gives some potential applications of PVITS, such as collaborative perception and object detection. Finally, the conclusion is drawn in Section VII.

A. ACP Theory and Parallel Vision
The ACP theory is proposed for solving modeling and control problems in complex systems [14], [15], including artificial systems, computational experiments, and parallel execution. ACP theory connects virtual and physical worlds through parallel management and solves real-world problems by conducting experiments in virtual societies. The core idea of ACP theory is to take the virtual world as the other half of the problem-solving space, and form a complete complex space together with the physical world. ACP theory has been developed into several fields, such as parallel traffic [16], [17], parallel vision [18], [19], and parallel medical [20], [21].
Parallel vision, first proposed by Wang et al. [18], is a new theoretical framework established by introducing ACP theory into the field of computer vision, consisting of three stages, i.e., artificial scenes, computational experiments, and parallel execution. It can better overcome problems that traditional computer vision methods exist, such as data acquisition [22], model learning [23], and model evaluation. Wang et al. [18] elaborated on the concept and basic framework of parallel vision for perception and understanding of complex scenes. However, this article focuses on the detailed description of the theoretical framework and has not extended the theory into practical applications. Some relevant literature has emerged recently based on parallel vision theory. Li et al. [24] constructed the large-scale artificial scene and collected a new virtual dataset, named ParallelEye, for the traffic vision research, which only focus on the first stage of parallel vision, i.e., artificial scenes. In order to deal with the distribution mismatch between synthetic and actual domain, Zhang et al. [25], [26] proposed a synthetic-to-real domain adaption method from the aspect of image level and region level, which extracts domain-invariant features effectively [27]. They only concentrate on the second stage of parallel vision, i.e., computational experiments. Wang et al. [28] proposed a theoretical framework named long-tail regularization (LoTR) and a parallel vision actualization system (PVAS), which can regularize long-tail scenarios and search for challenging long-tail scenarios. They apply it to optimize the configuration of the competition tasks of the Intelligent Vehicle Future Challenge of China, which is a typical application case to overcome real-world problems.
Different from them, which concentrate on a certain aspect, such as application case or model performance, we conduct systematic research on parallel vision for ITS. In addition to introducing the framework of PVITS in detail, we also summarize and highlight the opportunities brought by PVITS considering the characteristic of the transportation system. The challenges of PVITS are also raised from the aspect of scene distribution shift, model structure design, and privacy protection. Furthermore, we give several solutions to tackle the challenges and conduct experiments on two typical cases to demonstrate the effectiveness of PVITS.

B. Metaverse Technology
Metaverse [29], [30] is a combination of the prefix "meta" (meaning transcendence) and the suffix "verse" (logogram of universe), first created in a science fiction novel named Snow Crash. It describes an interaction between the real and the virtual world. With the development of science and technology, the metaverse is gaining widespread attention. Generally speaking, a metaverse is a digital world of virtual and real interaction that integrates multiple emerging technologies, such as virtual reality [31], augmented reality [32], 5G network [33], blockchain [34], and digital twins [35].
The metaverse builds the virtual world that not only reflects the physical world but also has the ability to infinitely expand, thus forming an interaction space where the virtual and physical world interact and entangle with each other [36]. From the perspective of engineering, the metaverse has the representative characteristics of CPSS, which consists of three worlds (physical, virtual, and mental worlds) and two spaces (physical and cyber spaces). The development of metaverse usually contains three phases: 1) digital twins; 2) digital natives; and 3) surreality. The first phase produces virtual and high-fidelity digital twins of the physical world and the second phase mainly focuses on native content creation. In the last phase, it turns into a surreality world that assimilates reality into itself. So far, the metaverse has not really been realized, and its realization still depends on the development and integration of various technologies.
Metaverse also inspires the development of intelligent transportation. Motivated by the intelligent vehicles (IVs) which can host a local metaverse for its built-in computing and networking capabilities, Zhou et al. [37] defined a fusion framework for vehicular industries and the metaverse named Vetaverse (Vehicular-Metaverse), which can be used for monitoring and managing large transportation systems. Pamucar et al. [38] considered four alternative metaverses and evaluation metrics. They also provide a case study to demonstrate the applicability of the metaverse assessment framework. Although the metaverse has promising applications in transportation systems [39], there still exist many challenges, such as limitations in software and hardware, security, and network issues.

III. PARALLEL VISION FOR INTELLIGENT TRANSPORTATION SYSTEM IN METAVERSE
Computer vision techniques have made significant progress in ITS. However, there still exist several major challenges that hinder further advances in ITS development [40], such as the generalization of vision methods on various traffic conditions, especially in some extreme scenes (e.g., sudden accidents, terrible weather, and violation of traffic rules), the collaboration effectiveness of different autonomous transport devices using computer vision techniques, and the adequate fusion of multisensor perceptual data. PVITS provides an effective solution to overcome these challenges. It is well known that metaverse aims to build a virtual space that runs in parallel with the real world, realizing the connection between the virtual world and the real world. Building upon this, MTS consists of a physical transportation space and a virtual transportation space. MTS simulates physical transportation space to generate virtual space and big data. Virtual big data is guided by model learning and focuses on generating complementary knowledge. Data and knowledge are generated and updated in parallel execution, and the robustness and generalization of the algorithm are constantly improved in this process, as shown in Fig. 1. The digital model built by metaverse in the virtual transportation space can simulate real traffic elements and conditions, which contains the static and dynamic models. The static model is to restore the static elements related to vehicle driving in the scene, such as traffic roads (including highways, urban expressways, urban trunk roads, urban branch roads, residential roads, suburban and industrial areas, rural roads, etc.); natural environment elements (including rain, snow, fog, sunny, daytime, night, etc.); and static traffic elements (including traffic signs, street lights, stations, tunnels, surrounding buildings, etc.). The dynamic model includes vehicle function elements (including path planning, lane keeping, left and right turning, obstacle avoidance, traffic light response, etc.); types of traffic participants (including motor vehicles (such as buses, trucks, cars, and motorcycles) and nonmotor objects (such as adults, children, and animals); and behavior of traffic participants (including going straight, crossing the road, pressing the line, standing still, etc.).
In MTS, IV adopts a variety of sensors (millimeter-wave radar, lidar, cameras, and satellite navigation) to sense the surrounding environment [41], collect data [42], and perform prediction of static and dynamic objects [42]. Combined with navigation map data, IV carries out systematic calculation Then, we adopt various computational experiments to design, train, and evaluate vision models, which consist of model design and synchronization, model prediction and optimization, and experimentation and evaluation. Finally, the feedback optimization of models can be realized based on parallel execution. and analysis, so as to make drivers aware of possible dangers in advance, effectively increasing the comfort and safety of car driving. Therefore, environmental perception is the basic function of IV, and accurate perception of road environment information is the premise for realizing calculation and analysis. PVITS provides a methodology to enhance the environmental perception ability of IV. It effectively overcomes the problem that the computer vision method has poor scalability in bad weather, unexpected accidents, and other long-tail traffic scenarios. Besides, the problem that the lack of data magnitude and diversity limits the generalization performance of the vision model has also been solved well. PVITS first constructs virtual transportation space to simulate real scenes with complex challenges and generates large-scale and diverse virtual datasets with detailed and accurate annotation automatically. Then, the virtual big data and the real small data are combined to carry out comprehensive computational experiments, aiming at learning and evaluating the vision models by virtual-real interaction. Finally, the vision models are executed in parallel in the virtual and physical space, and real-time feedback optimization is carried out to realize intelligent perception and understanding of complex scenes. It mainly consists of the construction of virtual transportation space, model learning based on computational experiments, and feedback optimization based on parallel execution, as shown in Fig. 2.

A. Construction of Virtual Transportation Space
To overcome the problem that the lack of data magnitude and diversity limits the generalization performance of the vision model, the first component of PVITS, the construction of virtual transportation space, provides an effective solution. First, a lifelike virtual space is constructed according to the real transportation space. It simulates complex environmental conditions of the real scene and generates the virtual big data automatically. The space construction can use opensource simulation simulators or commercial game engines, such as Unity 3DS MAX, OpenGL, and Google 3-D, which mainly utilize computer graphics [43], virtual reality [44], transportation simulation [45], [46], and other technologies. After completing the space construction, it is necessary to set up virtual sensors for the scene. These virtual sensors simulate the physical parameters of the actual sensors to generate diverse virtual data sequences. The entire stage automatically generates accurate and multilabeled virtual data, which can be applied to various vision tasks, such as object detection, object tracking, semantic segmentation, instance segmentation, panoramic segmentation, and depth estimation.
Specifically, virtual transportation space is composed of many elements, including static objects, dynamic objects, seasons, weather, light sources, etc. Static objects in the virtual space have appearance properties similar to the physical space. Dynamic objects should have the functional properties of the real targets. Season and weather directly affect the rendering effect, which is required to be consistent with the physical laws of the physical space. For example, there are plants blooming in spring. The light source during the day is mainly sunlight, and the light source at night is mainly street lights and car lights. Ensuring the fidelity of virtual space can greatly reduce the domain gap between the two parallel spaces from the perspective of simulation.
Compared to the real data, the generated virtual data has the following advantages.
1) Virtual data is easier to collect, and the annotation information is more comprehensive. 2) Virtual data is easier to expand in scale and diversity.
By setting different physical models and parameters, it is possible to obtain unlimited diverse data. Therefore, it is more conducive to training and evaluating vision models.
3) The physical space is usually not repeatable, but the virtual one can restore various scenes with fixed parameters, so as to evaluate the vision models from various angles. 4) Visual data cannot be obtained in some real scenes, such as high-speed rail line fault detection and battlefield data detection. Vision models can be designed and validated by generating virtual data in the constructed virtual space.

B. Model Learning Based on Computational Experiments
Model learning based on computational experiments is to design, train, and evaluate vision models by combining virtual and real datasets. Due to the complex and changeable traffic environment, it is tough to collect real data covering comprehensive traffic scenes. The existing vision models cannot be fully trained and evaluated under various complex traffic conditions, which are only evaluated effectively in limited scenes. Such models may have unexpected results in practical applications. In order to enhance the robustness and generalization of the vision model, comprehensive and sufficient experiments need to be carried out in a complex and changeable environment. Compared with experiments based on physical space, virtual space can simulate a variety of complex and changing traffic conditions in a controllable way. The models can be flexibly evaluated and updated under various traffic scenes, even under long-tail extreme cases.
Computational experiments contain three steps: 1) model design and synchronization; 2) model prediction and optimization; and 3) experimentation and evaluation. We first perform model design and synchronization in the virtual and real transportation space. Then, the vision models are employed to transform the big data into complementary knowledge and optimize it under various complex virtual traffic scenes. After that, experimentation and evaluation are conducted to improve the robustness and generalization of vision models.

1) Model Design and Synchronization:
In this step, we design novel vision models to improve the accuracy, robustness, and generalization of that in the actual traffic environment. Currently, common vision models are trained based on public datasets, such as KITTI and MS COCO. Due to the limited category and quantity, the models trained on these datasets are difficult to directly apply in practical traffic scenes. The virtual space can simulate various complex and changing environmental conditions, such as sudden traffic accidents, severe weather, and low-light conditions. The generated large-scale and diverse virtual data can be regarded as an effective supplement for designing and learning vision models. Besides, the vision models are synchronized in virtual and real transportation spaces.
2) Model Prediction and Optimization: In the model prediction and optimization process, we can first train on the virtual data, and then transfer to the real data for fine-tuning; or we can incorporate the virtual and real data in proportion to train the vision model. Due to the common problem of data offset, that is, the data in the source domain (virtual space) and the data in the target domain (physical space) have different distributions, it is necessary to conduct domain adaptation research, such as constructing the shared latent feature space between the source and target domains. The features from this space satisfy the consistent distribution, so as to guide the model to effectively obtain and utilize the latent shared information between two domains. After this step, we can realize an unbiased transfer of vision models from the virtual space to the physical space.
3) Experimentation and Evaluation: In this step, we can perform an infinite number of experiment verifications in the virtual space, and then perform one-step verification in the physical space. This is the cost saving and simple way since the virtual space is controllable and reproducible and directly modifying different parameters can rapidly produce various validation scenes. Furthermore, auxiliary evaluation on virtual space can fully control various environmental conditions (such as lighting, weather, and roads), object appearance, and motion status. This evaluation method can also greatly reduce the uncontrollable factors in practice and increase the generalization and robustness of the vision models.

C. Feedback Optimization Based on Parallel Execution
On the basis of computational experiments, the trained vision models are executed in parallel in physical space and virtual space. Through evaluations of various complex conditions in physical space, the problems of vision models can emerge. We can fine-tune and update the model specifically in the virtual space again according to the problems that arise. Through this virtual-real interaction and feedback, the vision models are evaluated and optimized effectively and continuously. This feedback optimization based on parallel execution can enable the vision models to perform effective visual perception and understanding in complex traffic environments.
The typical characteristic of parallel execution is combining the physical space and the virtual space closely to realize feedback optimization. The virtual space can simulate various complex traffic conditions, including not only general scenes but also scenes that rarely occur and are tough to collect in the physical space. Optimizing the model in these scenes can significantly improve the robustness and generalization of the model. In some scenes, if the vision model makes a wrong prediction, it may lead to serious consequences. Taking autonomous driving as an example, wrong detections may lead to a severe traffic accident; in unseen cases, the vision model may be completely unpredictable. The virtual space can simulate these long-tail scenes and unknown scenes in advance and improve the model performance more safely and effectively. At the meantime, there may be some unconsidered actual situations in the virtual space, so that the virtual space cannot completely cover all the distribution of the physical space. For these situations, based on the problems and prediction errors encountered in the application of the model in the physical space, real-time feedback can be given to the model learning and optimization in the virtual space. In this way, we can improve the robustness and generalization of vision models continuously.

A. Distribution Difference Between Virtual and Real Transportation Space
Deep learning methods can automatically extract highlevel semantic features of raw data and meet end-to-end requirements in practical applications. However, traditional deep learning methods belong to standard supervised learning. There are two necessary conditions for the effectiveness of supervised learning. One is to rely on a large amount of labeled training data, and the other is to assume that the training data and test data obey the same distribution. Both conditions are indispensable. In the virtual transportation space, we can automatically generate large-scale and diverse data with detailed annotation. If a vision task needs to be implemented in a brand-new scene, a large amount of training data needs to be collected and the data needs to be labeled. The data labeling process can be rapidly achieved in the virtual transportation space, which saves a lot of human resources and time costs. However, the second condition does not necessarily hold in the PVITS.
Due to the influence of the accuracy of the rendering engine and the complexity of the external environment, the distribution of data changes with time, weather, and viewpoint, making the distribution between the virtual data and the real data not match. For example, the geometry and surface texture of objects in the virtual space should be consistent with that in the physical space; when virtual pedestrians move in the virtual space, their speed and gait should be similar to real pedestrians; when virtual vehicles drive on virtual roads, they should follow the driving behavior of real vehicles. However, all virtual models are simplifications of physical models, requiring an appropriate compromise in the level of detail. If too little detail is included, the fidelity of the virtual data is too low; conversely, if too much detail is included, the model would become too complex to process. In addition, in order to improve the accuracy of the rendering engine, an accurate sampling and rendering process is required. This is technically infeasible and computationally intractable. Therefore, the generated virtual data always have a large distribution difference from the real data. As for vision models, even though they perform well on virtual data, the performance on real data usually degrades considerably.

B. Structure Design and Theoretical Interpretation of Vision Models
Many exciting research results have emerged in the field of computer vision since 2012. For example, the performance of face recognition, object recognition, and classification has approached or even surpassed the human visual system, which is mainly due to the development of deep learning technology. However, the application of deep learning models to vision tasks lacks sufficient theoretical support, and the interpretability of the learned models is weak. The proposed PVITS framework simulates the dynamic changing information of the outside world through the artificial scenes constructed by the virtual transportation space and evaluates the influence of changing environmental conditions on the parameters of the vision model. Although it can increase the interpretability of the model to a certain extent, there still exist many limitations. The basic issues, such as how to choose a model, how to determine the depth of the model, and the nature of deep learning have not been well explained. This problem is particularly prominent in autonomous driving in intelligent traffic scenes. It is difficult to understand the essence of vision models based on deep learning, and there are potential safety hazards, especially the limitation of liability after an accident. Moreover, input with adversarial perturbations can mislead the perception model of a self-driving vehicle, making them make mistakes in classifying road signs, which may lead to catastrophic consequences. For example, Tencent Cohen Security Lab found that putting a few stickers on a specific spot on the road allows a car in autonomous mode to merge into the reverse lane. Therefore, deep learning theory needs to be further improved to provide guidance for designing model structures, accelerating model training, and improving model performance, causality, and interpretability.

C. Data Security and Privacy in Virtual Transportation Space
Virtual-real interaction is the core of the PVITS, but a large number of virtual-real interactions inevitably lead to sensitive data leakage and privacy violations, since the process of virtual-real interaction further broadens the attack range of illegal network activities. Currently, such problems mainly focus on two aspects. The first aspect is the privacy problem brought about by the social function of virtual space itself. Social interaction in the metaverse needs to map all kinds of information of real people or vehicles to the counterpart in the virtual space, so the virtual counterpart becomes a collection of personal privacy information, and attackers can use social engineering methods or network attacks technologies to achieve the theft of personal privacy information. Meanwhile, some behavioral activities of the virtual counterparts in the metaverse also inadvertently leak a large amount of sensitive information, including driving behavior data in daily life. The second aspect is that a large number of information collection devices in the metaverse increases the potential for data leakage and abuse. In order to improve the user experience in the metaverse and realize a high-level virtual-real interaction, the applications in the metaverse use VR, AR, and supporting equipment to collect a large number of biometric signals, such as users' fingerprints, voiceprints, and faces. All identity information in the metaverse can be verified, and all actions can leave digital trails. In the metaverse with rich scenes and prominent immersion experience, intimate sensitive information is more likely to be leaked, and network public security issues are more prominent.

A. Transfer Learning for Reducing Distribution Difference
The virtual transportation space uses advanced computer graphics, virtual reality, micro-simulation, and other technologies to construct virtual scenes, simulate and represent complex and challenging actual scenes. The virtual data with detailed annotation can be generated automatically through rendering. Since virtual data and real data come from different spaces, there is a gap in distribution (i.e., data offset), and the model learned from virtual data cannot perform well when directly applied to real data. There are two main solutions to reduce distribution differences between the virtual space and the physical space.
The first solution is image style transfer. It refers to the process of imitating the style of one image (style image) and transferring it to another image (content image). The goal of style transfer is to obtain a new image whose artistic effect is similar to the style image, but the image content is consistent with the content image. With the style transfer technology, the existing real data can be used to generate more realistic virtual data with different lighting, time periods, weather, and seasons. The most widely used in style transfer is generative adversarial networks [47], [48]. Isola et al. [49] utilized the conditional generative adversarial networks [50] to solve the image style transfer task. However, it requires a large amount of pairwise labeled training data. In the meantime, the idea of dual learning [51], [52] applied in the field of natural language processing is introduced into the GAN network, so that the GAN network can be trained in an unsupervised manner, such as CycleGAN [53], DiscoGAN [54], and DualGAN [55]. The cycle-consistency loss function was first proposed in CycleGAN. The basic idea is that dual learning generates a reconstructed image, and the input image naturally becomes the label of the reconstructed image, so that a loss can be calculated to replace the content loss proposed in previous works. The cycle-consistency constraint has demonstrated its efficiency in style transfer tasks in many studies. The UNIT style transfer framework proposed by Liu et al. [56] introduces the concept of latent shared space and combines the variational autoencoder and generative adversarial network to further improve the effect of unsupervised style transfer. AugGAN [57] extends a semantic segmentation network based on CycleGAN, which adopts different weight sharing strategies at the head and tail of the decoder. It can greatly improve the training effect of the generated images on the object detection task.
The other solution is domain adaptation strategy. There are two basic concepts in the domain adaptation problem: 1) the source domain represents a different domain from the test sample, but has rich supervision information and 2) the target domain represents the domain where the test sample is located, with data unlabeled or only a few labels. The source and target domains tend to have the same class categories, but with different distributions. In our problem, the source domain corresponds to the virtual space and the target domain corresponds to the physical space. According to the number of target domain labels, domain adaptation can be divided into two major areas: semi-supervised [58], [59] and unsupervised domain adaptation [60], [61]. For semi-supervised domain adaptation, there is partially labeled target domain data. In contrast, for unsupervised domain adaptation, samples from the target domain do not have any annotations. In recent years, deep domain adaptation algorithms have been proven to achieve a better adaptive effect by adding the domain adaptation layer and have gradually become a key research area, which consist of pretraining methods, sample weight adaptation methods, and feature transformation adaptive methods. The main idea of the pretraining methods [62], [63] is to employ the existing pretraining model for fine-tuning to realize knowledge transfer across domains. Besides, according to the difference between the two domains, the model can freely choose the number of layers for freezing and fine-tuning, which has a certain degree of adjustability. The sample weight adaptation methods [64], [65] mainly measure the importance of the data and assign high weight to the two-domain data with high similarity and low weight to the two-domain data with low similarity. The feature transformation adaptive methods [66], [67] need to find a suitable feature transformation method, that is, a mapping relationship, through which the two-domain data can be transformed into a feature space with a closer distribution between the two domains. There are two kinds of feature transformation methods: 1) explicit measurement and 2) implicit alignment. Explicit measurement methods directly reduce cross-domain difference, such as MMD [68], MK-MMD [69], JMMD [66], and CORAL [70]. The implicit alignment methods integrate the game theory of the generative adversarial network into the deep domain adaptive algorithm and narrow the distribution distance across domains through the mutual confrontation between the feature extractor and the discriminator [67], [71], [72].

B. Theoretical Understanding of Neural Networks
Due to the lack of understanding and analysis of the internal mechanism of the neural network, the deep neural network is usually regarded as a black-box model. The users can only observe the decision results of the model, but cannot understand the reason for decision, greatly limiting the effectiveness of model structure design and its development. If a breakthrough in theoretical interpretation research can be achieved, it will be of great significance to the further development and application of deep learning. On the one hand, based on the internal mechanism of deep neural networks, we can design model structures that are more suitable for specific tasks and set better parameters instead of empirical tuning. On the other hand, we can overcome the vulnerability of deep neural networks to resist adversarial attacks.
Interpretability research has two main research directions [73]: 1) intrinsic explanation and 2) post-hoc explanation. The purpose of the former is to make the model itself interpretable, allowing humans to understand the process and basis of their decisions without additional information. Li et al. [74] introduced partial segmentation and attention mechanism into the fine-grained classification and adopted image labels to train an interpretable model with high accuracy. The interpretability of the model is manifested in that the model predicts the segmentation maps and heat maps. The segmentation maps accurately encode the concepts appearing in the image, while the heat maps highlight the important regions for prediction. Chen et al. [75] designed a framework for image classification in a manner similar to human reasoning. By predefining some image patches in the training set as concept prototypes, the model needs to learn the feature space of these concept prototypes during the training stage. The classification is performed in the testing stage by calculating the similarity between the patch of the input image and all prototype patches. The interpretability of the model shows that the similarity matrix with the prototype patches can be upsampled back to the original image size for visualization, and the linear mapping of similarity scores itself is very clear as a classification basis. The latter designs interpretability algorithms for the trained models to explain the model's behavioral logic and decision-making basis. LIME [76] divides the original image into superpixels and obtains multiple perturbed images by randomly sampling. Then, it uses these images to learn a linear model to approximate the prediction results and obtain the importance score of each superpixel. The top-ranked superpixels are displayed as the interpretation result. RISE [77] randomly samples many masks to occlude the original input image, and then multiplies the masks and the original image correspondingly to obtain the prediction scores of these disturbed images. The mask is weighted by the prediction scores to obtain the final interpretation result. The above methods give some ideas for exploring model interpretability, which can be used to guide the design of more effective model structures.

C. Protecting Security and Preserving Privacy
Virtual-real interaction is the basis for various applications in the PVITS, but a large number of virtual-real interactions inevitably lead to sensitive data leakage and privacy violations. The process of virtual-real interaction depends on cellular vehicle-to-everything technologies [78], [79], which further expands the attack range of illegal network activities. Therefore, building a trusted ecosystem is a necessary and critical consideration in the development of Metaverse technologies. These trusted ecosystems can build algorithms, structures, frameworks, regulations, and policies within the development cycle of hardware and software to address the different elements embedded in their technology, such as security, privacy, and guarantee. Specifically, in response to the privacy issues brought by the social functions of the Metaverse, users can create multiple avatars to confuse and hinder aggressive behavior, or create separate copies of space to isolate them from the surrounding environment. In response to the problem of data leakage in information collection, it is necessary to design an effective encoder-decoder model to improve the confidentiality, anonymity, and concealment of the collected data [80], [81].

A. Application in Collaborative Perception
The autonomous driving system uses onboard sensors, such as cameras and lidar, to detect objects around the vehicle to assist the driver in driving safely, but the detection range and accuracy of a single vehicle are limited. Especially, when the objects are severely occluded or have very small sizes, the detection performance will degrade rapidly. Such situations are very common in the real world, but also very dangerous. These blind spots are extremely difficult to deal with for an autonomous vehicle. Collaborative perception provides a way to solve the above problems. It can intelligently select and transmit the environment data among vehicles, which improves the accuracy and reliability of environment perception. Compared with the individual perception and multisource information fusion method, it has a wider detection range and higher accuracy, as shown in Fig. 3.
However, there exist many challenges to fully verify the effectiveness of the collaborative perception methods in the actual traffic scene. First, the economic problem of autonomous driving has not been well solved. To verify the collaborative perception methods, the autonomous vehicle needs to be equipped with a variety of different onboard sensors, such as cameras, millimeter-wave radar, and lidar. The hardware cost is too high and it is difficult to ensure the economy of the vehicle. Second, there are many limitations in the actual scene to verify the methods. To achieve full verification, numerous scenarios need to be simulated, which is really difficult and complex. Finally, safety is still the key reason that affects the actual landing of automatic driving. The inability to effectively deal with highly challenging traffic scenarios, such as bad weather conditions, small objects, and heavily occluded objects, would lead to serious traffic accidents, which may cause irremediable damage. Benefiting from PVITS, we construct the virtual space and generate the large-scale virtual dataset with detailed annotation from it to achieve comprehensive verification of collaborative perception methods. We use CARLA and SUMO [82] simulation tools to achieve this. CARLA provides open digital assets (urban layouts, buildings, and vehicles) that are created for the validation of autonomous driving systems and can be used freely. The simulation platform supports flexible Fig. 4. Flexible weather changes of simulated virtual data. The first column means the virtual data in sunny weather. The second column represents the virtual data in the evening. The third column is that in the foggy weather and the last column is the virtual data after the rainy weather. specifications of sensor suites, environmental conditions, full control of all static and dynamic actors, and map generation. SUMO can simulate the motion behavior model of vehicles to realize the path regulation functions, such as planning and obstacle avoidance.
Specifically, we simulate various scenes in the simulation software to represent the real-world challenging driving environment. A virtual transportation scene is built by adding traffic objects (including roads, buildings, trees, flowers and plants, traffic signs, pavement markings, pedestrians, nonmotor vehicles, motor vehicles, etc.) to the road network and setting physical attributes. Agents are used to represent objects and environmental factors in the virtual space. Each agent has its own attributes, and multiagent simulation is conducted according to physical laws. We can put static agents in the corresponding position of the virtual traffic road network, and let dynamic agents (pedestrians, vehicles, etc.) move in the virtual space. There exists a communication mechanism between related agents. Dynamic agents have a motion behavior model to realize path planning, obstacle avoidance, and other functions. In different areas and periods, virtual scenes have different levels of congestion, just like the actual traffic conditions. Furthermore, there are about three connected vehicles on average, with a minimum of 2 vehicles and a maximum of 7 vehicles per frame. Each autonomous vehicle is equipped with four cameras that can simultaneously cover a 360 • field of view. It is also equipped with a 64-channel LiDAR, with 1.3M points per second and 120-m capturing range, and GPS/IMU sensor, with 20-mm positional error and 2 • heading error.
We display some simulated virtual data, as shown in Fig. 4. It is easy to simulate different weather changes in the virtual space, such as foggy days, sunny days, rain, and night. In addition, the natural laws of the virtual space are consistent with that of the physical space. For example, in the first column of Fig. 4, there are shadows on sunny days, the third column indicates that objects are fuzzy on foggy days, and the road is wet after the rain in the last column. Besides, different actual traffic objects can be realistically simulated in the virtual space, such as cars, trucks, motorcycles, and pedestrians, as shown in Fig. 5. No matter how bad the lighting and weather conditions are, or how fuzzy the image details are, it is convenient and rapid to automatically obtain detailed and accurate annotation information. We can design different annotation standards according to the vision task. In general, the information that can be labeled includes object bounding boxes (2-D/3-D), object region, category type, motion trajectory, image semantic segmentation, depth, optical flow, etc.

B. Application in Object Detection
It is an attractive way that we can employ the constructed virtual space to simulate and reproduce various actual challenging traffic environments. However, the vision model trained on the virtual data cannot be directly applied in the real traffic space since there is a domain gap between the virtual space (source domain) and the physical space (target domain). To address this problem, we propose a syntheticto-real adaptive learning method (S2RAL) for cross-domain object detection inspired by PVITS. We first review the problem formulation. Suppose that due to time and labor cost constraints, the actual traffic scene does not have any labeling information, that is, virtual data D v = {(I v , B v , C v )} has precise bounding box and category labeling while real data D r = {I r } is without any annotation information. B v represents the bounding box annotations and C v denotes the relevant category labels in the virtual space. The ultimate goal of the proposed S2RAL is to achieve a domain-invariant detector by utilizing D v and D r .
In particular, we initialize two parallel identical detectors, i.e., model in the virtual space (VM) and model in the physical space (PM), as shown in Fig. 6. The VM is trained by standard gradient update and PM is optimized by the exponential moving average (EMA) of the weights from the VM. The VM utilizes the available virtual data with precise annotations D v = {(I v , B v , C v )} to optimize the model. Since the real data D r does not have any annotations, to train and optimize the VM in the meantime, we depend on the PM to generate the pseudo-labels for the physical space. The PM predicts the pseudo-labels to optimize the VM while the VM transfers the extracted knowledge back to the PM via EMA. Iteratively, it is beneficial for the PM to predict more precise pseudo-labels.
Since the above process only utilizes the available virtual data, even the prediction learning of pseudo-labels on the physical space is primarily utilizing the knowledge from the VM learned from the labeled virtual data, the VM and PM are both easily biased toward the virtual space. To deal with this, we use the adversarial learning strategy to align the distribution across two spaces. A domain discriminator is added after the feature extractor E (shown in Fig. 6) of the VM, which is trained to discriminate which domain the feature is from (virtual or real). The gradient reverse layer [83] is used between the feature extractor and the discriminator to achieve the adversarial learning, since in the forward process, the input and output are unchanged, and in the gradient backpropagation process, the output is the negative direction of the input. Thus, the discriminator aims to minimize the discriminator loss while the feature extractor aims to maximize that. This virtual-real interaction way allows the VM model to reduce domain shifts and benefits the PM model to generate more accurate pseudo-labels.
To validate the proposed S2RAL, we use the virtual SIM10k as the source domain and the real Cityscapes as the target domain. The virtual SIM10k [84] is a simulated dataset that is collected from the game engine Grand Theft Auto. Cityscapes [85] is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities. The training data contains the virtual SIM10k dataset (10k images) with accurate annotations and the real Cityscapes dataset (2975 images) without any annotations.

VII. CONCLUSION
Under the blessing of metaverse technology, the ITS can comprehensively enhance traffic operation efficiency and motivate traffic safety. Built upon MTS, the proposed PVITS aims to overcome several major challenges in environment perception, such as the generalization of vision methods on various traffic conditions, the collaboration effectiveness of different autonomous transport devices using computer vision techniques, and the adequate fusion of multisensor perceptual data. It mainly consists of construction of virtual transportation space, model learning based on computational experiments, and feedback optimization based on parallel execution. Several technical challenges need to be overcome to ensure that the PVITS is reliable and practical. Besides, we present some potential solutions and applications to tackle the technical challenges and to fully exploit the superior characteristics of PVITS. There is further research needs to be applied in many other vision tasks in order to evaluate the benefit of PVITS. Guiyang Luo received the Ph.D. degree in computer science and technology from the Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2020.
He is currently a Postdoctoral Fellow with the State Key Laboratory of Networking and Switching Technology, BUPT. His current research interests include multiagent systems and intelligent transportation systems.