On the Exploitation of 5G Multi-Access Edge Computing for Spatial Audio in Cultural Heritage Applications

This work presents a service for the improvements of cultural heritage experiences, which exploits the advantages coming from the 5G paradigm. Indeed, in a scenario where many users need to be served by a real-time solution which is in turn required to work on different devices, the potentialities of 5G technology show their suitability. In particular, moving the computation to the edge of the network ensures the availability of resources needed for binaural spatial audio rendering in an independent fashion w.r.t. the client device and at the same time it guarantees real-time availability of this data since the core network, with its impairments, is not involved. This work demonstrates how 5G could be a critical enabler for delivering low latency services at guaranteed levels, data-centric services, differentiated customer experiences, improved security and reduced costs to the users.


I. INTRODUCTION
The 5 th generation of mobile network promises great changes that may lead to a different way of understanding mobile communications, moving from the need of connecting people to the need of connecting their worlds. The business ecosystem around this paradigm is involving many different vertical industries on various fields.
Three categories of services can be identified for 5G services, [1]: • enhanced Mobile Broadband (eMBB) that aims to support high bandwidth demanding services; • ultra Reliable and Low Latency Communications (uRLLC) has been introduced in order to cope with safety and mission critical services by guaranteeing high reliability and low latency communications; • massive Machine Type Communications (mMTC) as enabler for IoT services which require high density of connected devices.
Vertical services exploited by 5G trials should include media and entertainment, public safety, e-health, automotive, transport and logistic, Cultural Heritage (CH), [2]. The latter, in particular, has been significantly affected by the progress of digital information especially conceiving its dissemination [3], offering new technological possibilities for developing, e.g. the market of tourist services [4], and CH organizations have to address new users needs by creating innovative applications [5], such as Augmented and Virtual Reality (AR/VR) based. AR technology gives a different perception of reality, as it enriches reality with a computer-generated layer containing visual, audio, and tactile information while using a "virtual" representation of a classic museum allows access to aspects of the artifact that may otherwise be hidden [6].
During the past years, the main aim of AR/VR applications for CH changed from a mere virtual recreation of object to display, to create an entire virtual environment able to disseminate and teach culture. The idea is the opposite of a "dead museum": users must not be exposed to an accu-mulation of 3D heritage objects, but feel and understand another culture through those items. An important aspect is the relation between AR/VR and education. This new way to present Culture enhances the learning process, encouraging students and researchers through stimulating methods of presentation of archival materials and historical events. Users can therefore travel through space and time without moving from their home [7]. Numerous AR/VR applications exists for CH or tourists enjoyment of places with a rich past, allowing a realistic navigation of environments that no longer exist or that may be inaccessible, [8], [9], [10], [11], [12].
The most of the previously cited experiences do not take into account the advantages that may arise by a proper exploitation of sounds together with visual effects, except for audio content presentation purposes. On the contrary, an acoustic guide, properly placed in the virtual space, may drive the user toward a certain direction or the acoustic landscape of a specific historical period could be reproduced to improve the virtual experience. To the authors' knowledge only a few experimentation have been carried out in this context.
The presented service implements spatial audio rendering through 5G in order to exploit possible advantages such as the reduction of the computational requirements for the local devices and the increase of the amount of users that may require the same service on smaller and lighter AR/VR devices.

A. MOTIVATION
5G technology foresees impressive numbers as guaranteed latency of some milliseconds and throughput higher than 1Gb/s, [13], which may perfectly fit the requirements coming from the exploitation of AR/VR solutions for CH.
To be more specific, the need of a binaural sound rendering solution to act in real time as the user moves his head or changes his position in time as well as the addition of video streaming to the audio, would require ultra low latency and high throughput for a proper experience.
Moreover, being aware that the computational requirements of real time adaptation of both audio and video streaming are high, the scenario under analysis may also benefit from a new born paradigm, in the context of 5G networks, known as Multi Access Edge Computing (MEC). A MEC approach in such context has also the advantage of allowing to offload the multimedia elaboration to the edge of the network, making the user devices less complex and power hungry.
MEC enables services and applications to be hosted 'on top' of the mobile network, i.e. above the network layer. These services and applications can benefit from being in close proximity to the users and from receiving local connectivity enabling new business opportunities. In malls, [14], university areas, [15], or museums, [16] that are filled with high-value users, 5G MEC can provide value added services, such as local cache service, location service, and targeted advertising. At business campuses, factories, and seaports, 5G MEC can provide enterprise-level services, such as virtual private networks, service hosting, and dedicated applications. Due to the possibility to drastically reduce experienced latency and offloading computation to the edge, MEC as emerged as an enabler for a wide range novel services including Industial IoT [17], low-latency mission critical applications [18], vehicular communications [19], and multimedia services [20].
Referring to the proposes scenario, by admitting edge computing, the limit of 40ms between head movement and spatialized sound, foreseen in [21] as the limit above which a processing/transmission delay is perceivable, can be also exploited for improving signal processing algorithms given that the communication is performed on a network responding to the uRLLC paradigm.
Advantages coming from exploitation of 5G for AR/VR for CH thus need to be deepened because 5G CH AR/VR could represent a possible killer application for this communication technology.
Finally, some experiences related to exploitation of 5G for AR/VR applications were conducted by the authors within the 5G trial carried out in L'Aquila, [22].

B. PREVIOUS WORKS
Real time binaural rendering solutions for AR/VR enjoyment of CH by exploiting 5G solutions is a topic involving various research aspects. In this section a brief state of the art of the most relevant topics involved in the project is presented.

1) Spatial audio for AR/VR CH applications
As previously stated, the importance of sounds in the context of AR/VR for CH is not yet understood and the research literature is quite poor.
One of the few exception is given by the work in [23], where authors present a signal processing method for fast real-time binaural synthesis, whose main target application is the fruition of cultural heritage and the work in [24] where a smart headphones set is presented that remotely takes the orientation of the listener's head and properly generates an audio output to attract the tourists'attention toward specific points of interest in the 3D space. An interesting analysis of hardware and software requirements for this purpose is presented in [25] without references to real applications.
In [26], authors propose an interaction system for attracting the visitor toward specific cultural attractions through 3D audio. The most of the paper is focused toward the design and development of a system for gathering head orientation in real time, while at the end, interesting tests are carried out to evaluate the advantages coming from the exploitation of spatial sounds with respect to stereo ones and the need of video for better audio source localization. The Ghost Orchestra is an interesting project involving exploitation of binaural spatialization and visors for VR (i.e. Oculus) for cultural heritage as described in [27].
The interest on audio spatialization by the research community has its origins on a paper written in the 60s by Schroeder, [28] introducing the idea of artificial reverberation based on digital signal processing, [29]. Indeed reproducing the behavior of acoustic waves spreading indoor requires ideally to spatialize each reverberation, i.e. to reproduce at the ears the sensation of a 3D space. Since then, the applied research and the market have gone toward different applications spanning from games, [30], to electroacustic music, [31] and soundscape design, [32] up to sonification, [33]. All these scenarios have in common the initial exploitation of the stereophonic approach, that provides a multi-channel reproduction system going from the traditional two channel stereo to the modern configurations with five, seven or more loudspeakers. This can be considered a channel based approach that freezes the position of the sound to the signals relations between loudspeakers, [34] and it is the one that is currently still used for cinema, home theater and pure audio content. The need of properly preserve spatial cues of an auditory scene has brought to the separation in coding of source signal and source location and to all the variety of spatialization techniques partially listed in the paragraph below, that have in turn significantly increased the potentials of video games, audio games, music expressions etc. Some remarkable results of applying spatial audio in these contexts are Mojang AB Minecraft for games or Karlheinz Stockhausen Cosmic Pulses.

3) Binaural sound rendering solutions
The binaural sound reproduction through headphones requires full control of sound synthesis and binaural cues to be guaranteed, at the expense of the need by the users of wearing devices that may be considered intrusive, especially when noise cancellation and ear occlusion are required by the application. In order to achieve auralisation through headphones, HRTF (head related transfer function) filters are commonly used for left and right-ear because with headphones, the effect of the head and the pinna (with earplugs) is bypassed. They are then convolved with an anechoic sound signal for audio rendering. The most of the literature is focused toward the exploitation of non personalized HRTF that are recorded using a dummy head (e.g. KEMAR manikin, [35]) on a discrete spatial grid in both azimuth and elevation, [36]. Given the apparent impossibility of overcoming problems of unnatural coloration of the frequency spectrum and localization degradation, recent literature is instead focusing toward the exploitation of personalized HRTF, which still requires a long evaluation procedure (see e.g. [37]). It is worth noting that the availability of HRTFs coming from the last decade of research has brought to a standardization process known as as the Spatially Oriented Format for Acoustics (SOFA), [38], a personalized version is claimed to be available here [39]. In both cases, given the discrete points along which HRTFs are measured/computed interpolation techniques have to be employed, [40].
Most of the available tools for binaural audio reproduction, moving the position of one of more objects, is based on procedures to move from surrounding systems solutions to binaural sounds. Each of them moves from the exploitation of static, dynamic and environmental cues, [41]. Static cues are given by HRTF or HRIR involving all the issues related to physical characteristics of each individual; dynamic cues are related to the motion of the listener and environmental cues are given by the room transfer function (RTF) or room impulse response (RIR).
For instance with the purpose of down mixing 5.1 to binaural, a solution exploiting virtual loudspeakers and HRIR, taking also into account the room response and head tracking data is proposed in [42].
The basic procedure for Ambisonic binaural rendering defines the virtual loudspeakers layout for which the corresponding output is computed as a linear combination of the B-format channels and finally HRTFs are introduced for each virtual loudspeaker and the obtained left channels are summed together, as it happens for right channels, and they respectively feed the left and right channels of the headphones, [43]. An optimized solution has been recently proposed in [44], where binaural decoding of Ambisonic soundfields is achieved basing on pre-computed, spherical harmonic-encoded binaural filters. Authors in [45] presented a vector base amplitude panning (VBAP) implementation for 3D head-tracked binaural rendering, where the binaural implementation of VBAP is achieved in the same way as virtual Ambisonics.
With the purpose of reducing computational requirements, various solutions have also been investigated simplifying the HRTFs and RIRs, see for instance [46], [47] and references therein, most of them at the expenses of perceived quality.
Finally, referring to [44] and [45], in the light of the main topic of this work, that authors propose opposite solutions about the device in charge of making spatial sound computations. The first reference indeed suggests to implement the VBAP rendering on an embedded Linux device that is placed locally. Advantages and drawbacks of the solution were discussed in terms of proper reproduction of the virtual source and response of the system to head-tracking data. It is interesting to observe that authors suggest the exploitation of second and third order Ambisonic to increase accuracy, stating that this would require computational resources not yet available on the CPU of an embedded system. On the other hand, authors in [45] investigate the opportunity of exploiting the distributor-side. This choice was mainly justified by economic reasons and drawbacks considered only in terms of absence of information about the listener environment. Problems arising in both cases could be overcome by exploiting the architecture proposed in this paper.

4) 5G enabled AR/VR
According to the recent Molex State of 5G survey, [48], AR and VR applications top the list of primary use cases for 5G technology in consumer applications.

VOLUME XX, 202X
AR and VR experiences introduce many technical challenges related to the need of combining and synchronizing the real or virtual world with the user's motions. This requires high computational resources for rendering that can benefit from moving partially or totally the computational tasks to the edge. As a consequence, introducing 5G in this scenario would allow to satisfy users quality of experience (QoE) guaranteeing very low latency thus a realistic experience. The exploitation of the concept of private network is the solution proposed by Ericsson [49], basing on which VR services are delivered either through enterprise dedicated private networks or through a logical network slice created on the top of the existing physical public network.
Despite the large amount of informative articles available on the web about the advantages 5G could bring to AR/VR applications (e.g. [50], [51], [52]), scientific papers or available products exploiting these two paradigms together are quite limited.
An interesting discussion about advantages of 5G and MEC in the context of Mobile Augmented Reality (MAR) is discussed in [53] where the authors also discuss a possible application on tourism, stating the current status of infancy of these applications so far. In [54] a demonstration of the important role of 5G networking for VR game is shown by moving game servers without service interruption. An interesting project is described in [55] where public trials demonstrating the advantages of 5G for smart tourism are reported. Authors in [56] and [57] discuss in detail the potentiality of 5G and Beyond 5G (BG) cellular networks for realizing mobile web augmented reality, presenting encouraging results.

C. PROBLEM STATEMENT AND MAIN CONTRIBUTIONS
This paper focuses on the exploitation of audio spatialization service, supported by the 5G network architecture, for improving AR/VR experiences for Cultural Heritage. Here follows a list of the main topics involved in the scenario under analysis that will be discussed in the remaining of the paper: • exploitation of Resonanace Audio for a CH VR application scenario • definition of a 5G based MEC architecture for VR support • analysis of advantages achievable through the proposed solution It is worth noting that, in this phase of the research, we assume a perfect and reliable behavior of the eventual head orientation localization system, i.e. we do not assume head tracking and thus the issues that may arise such as adaptation of the binaural sound to these movements. We are anyway confident of the proper behavior of the orientation/localization system on the specific device we are using for AR. Moreover, we do not face the problem of active noise cancellation for excluding sounds from the real word since our aim is to exploit lo-fi devices.

II. SCENARIO
We assume as general scenario a museum where more users exploit AR devices for moving around and getting dedicated information as a function of what they are observing and their preferences that have been eventually previously expressed. The referred scenario for each user is sketched in Figure 1. The behavior of the service consists in a bidirectional communication between a client, that gets environmental information using different sensors (e.g., cameras, gyro, beacons, etc.), and a MEC application able to process the information in order to produce a proper spatialized stream. As a result the client application can reproduce a personalized sound built in real-time basing on users' position into the museum space.

III. 5G SUPPORTING SPATIAL AUDIO AT EDGE
Audio applications dealing with binaural sounds require a very fast computation and playback of sounds to fulfill users' overall expectations. This can be reached both by computing audio at the user's device and by delegating computation to a remote node. The former option can offer the most reliable solution but it requires energy to perform computation. More energy means more weight to be carried by the end user (e.g., batteries) or reduced battery life. The latter option allows to release power constraints but introduces information transmission delay to move data through the network. Time needed to reproduce a valid sound sample is composed by the time to compute the sound plus two times the network delay to move information forward and backward from the end user device to the computational node.
With respect to previous mobile communications systems, 5G offers an unprecedented level of flexibility to fulfill service-specific requirements in terms of throughput, latency, and reliability. Such flexibility is achieved through novel techniques for the radio transmission and by novel architectural approaches. Two enabling technologies from the architectural viewpoint are: Software Defined Network (SDN) that allow to separate control and user plane information and to flexibly adapt the behavior of the network via software; Network Function Virtualization (NFV) that allows to implement and orchestrate traditional network functionalities over virtualized infrastructures.  Due to the virtualized nature of the 5G architecture, the core network is realized as a set of services among which the main ones are: (i) Access and Mobility Management Function (AMF) that is responsible for handling connection and mobility management tasks; (ii) Session Management Function (SMF) that is responsible for managing connections and session contexts; (iii) User Plane Function (UPF) that is the anchor element between the mobile and data networks. Figure 2 shows the considered 5G architecture for the spatial audio service. Normally, the traffic generated by the users has to traverse the core network in order to leave the mobile network and reach the data network. However this may results in too high delay not compatible with the service under consideration (red line in Figure 2). In order to overcome this issue and fully enable MEC capability an instance of the UPF is deployed at the edge of the network. Since it is possible to separate control and user planes, the control plane is still redirect to the 5G Core to perform signaling and control while the intelligent UPF (I-UPF) at the edge is used as an anchor towards the data-network for the user plane. This way, the user plane traffic can leave immediately the mobile network after reaching the 5G base station (i.e., gNodeB) and flow towards the elaboration server at the edge, thus reducing latency (green line in Figure 2).
The computing infrastructure comprises the computing resources placed at the edge of the network (i.e. in proximity of the gNodeBs) allowing MEC, and computing resources available in a remote cloud. On top of the network and computing infrastructure is a Service and Orchestration layer which is responsible for the management of the deployment and life-cycle of the spatial audio service.
The Service and Orchestration Layer is composed by the MEC Platform Manager and the DASH Service Platform. The MEC Platform Manager is able to provide the upper layer with a uniform northbound interface API decoupling the Service Layer and the Computing Resources Layer. Its functions include: • Life-cycle Management: this process takes care of the entire MEC application life-cycle, treating each MEC application instance as a Virtual Network Function (VNF) instance and performing health checks and auto healing for high availability • Policy Enforcement: it manages networks and applications connectivity, according to the network policies • Inter-host Management: this block is responsible of enabling networking and communications between different MEC hosts, exposing them to the Service Layer.

IV. REFERRED SPATIALIZATION TOOLS
Bringing rich, dynamic audio environments into AR/VR experiences without affecting performance can be challenging.
There are often few CPU resources allocated for audio, especially on mobile, which can limit the number of simultaneous high-fidelity 3D sound sources for complex environments. Various tools implementing binaural spatialization were investigated and their advantages and disadvantages discussed before making a choice, i.e. Resonance Audio (Google), 3D Tune-In Toolkit, [58], Audio Spatializer [Oculus (Facebook)], Steam Audio (Valve corporation), SOFAlizer [59], Spatial Audio Framework [60].
The choice fell on Resonance Audio, [61]. Resonance Audio is a multi-platform spatial audio SDK, delivering high fidelity at scale. The Resonance Audio SDK uses highly optimized digital signal processing algorithms based on higher order Ambisonics to spatialize hundreds of simultaneous 3D sound sources, without compromising audio quality. The SDKs run on Android, iOS, Windows, MacOS and Linux platforms and provide integration for Unity, Unreal Engine, VOLUME XX, 202X FMOD, Wwise and DAWs. Native APIs for C/C++, Java and Objective-C are also provided together with the full source code C++ library.
As part of the open source project, a reference implementation of YouTube's Ambisonic-based spatial audio decoder is provided. Using this implementation, developers can easily render Ambisonic content in their VR media and other applications, while benefiting from Ambisonics open source, royalty-free model. The project also includes encoding, sound field manipulation and decoding techniques, as well as head related transfer functions (HRTFs) used to achieve rich spatial audio that scales across a wide spectrum of device types and platforms. Lastly the entire library of highly optimized DSP classes and functions is available: this includes resamplers, convolvers, filters, delay lines and other DSP capabilities.

V. SPATIALIZATION SERVER APPLICATION
In order to demonstrate the proposed service a custom application has been developed. The application, written in C++, uses a custom Linux build of the Resonance Audio library and some utility libraries like the Boost C++ Libraries in order to handle the TCP socket communication between client and server in the easiest way, [62]. Figure  3 summarize the simple behavior of the developed service: a client streaming application (e.g., an AR mobile application) is able to connect to the spatialization server that is awaiting for requests from connected clients of spatialization processing.
The sample application starts with the generation of a stereo sin tone at 500Hz and stores the generated samples into a proper buffer. The duration and the sample_rate of the generated samples are expected as parameters. Algorithm 1 shows the samples generation process. Once a socket endpoint has been set-up the main loop of the spatialization server application act as summarized by Algorithm 2.
The main loop of the application is waiting for audio source position parameters sent through socket by the connected client and computes the spatialized streams. When a set of spatialized samples is ready it is transmitted to the client. The flow diagram showed in Figure 3 summarizes the spatialization server application behavior.

VI. PERFORMANCE ANALYSIS & RESULTS
In order to offer a performance analysis of the advantages introduced by MEC, in contrast to what happens with a cloud deployment as shown in Figure 4, in the proposed spatial audio use case we adopt the model proposed in [63].
The main key performance indicator for the overall system performance is represented by the achievable datarate to serve the spatial users which can be expressed as follow: where c is a constant depending on the specific TCP im-6 VOLUME XX, 202X FIGURE 4: MEC vs Cloud service deployment plementation, ack strategy, loss mechanism and congestion avoidance algorithm, [64], whose typical values are comprised between 0.9 and 1.2; M SS is the maximum TCP segment size; RT T is the round trip time between the users and the spatial audio elaboration node; P L is the packet loss ratio, and BR is the available data-rate, i.e. the network capacity, which in the following we assume to be 1Gbps per radio access point. As can be noticed from Eq. 1, the min operator models a possible bottleneck effect of the network. However, in the considered scenario with private dedicated indoor 5G coverage and high data-rate, this in not likely to happen. The term RT T × √ P L shows the inverse relation between the experienced throughput and round trip time and packet loss. On this hand, the advantages deriving from MEC are twofold in fact: • thanks to the possibility to perform the elaboration closer to the users, RT T is reduced • packets exchanged are not sent over the internet service provider network where packet drops due to congestion can occur generating packet loss, thus MEC has also the advantage to reduce P L Given the above, and assuming that each user generates one TCP flow, all the flows have the same size, and all the flows are terminated at the same location, the number of supported spatial audio users can be expressed as: where R is given by Equation 1 and T H u is the throughput per single user. T H u depends on the selected audio quality and can be calculated as: where W is the sample bit-width whose typical values are 16, 24 and 32bits; S is the sample rate that we assume equal to 44100Hz. Figure 5 shows the number of supported spatial audio users per different audio qualities (16bit, 24bit, and 32bit, respectively) and different network conditions represented by a packet loss ratio varying from 0.01 to 0.001. As expectable, the number of supported users decreases by increasing the quality of the audio, due to the higher required throughput per user. Furthermore, it has an inverse relation with the PL. It is worth mentioning that variations of PL may be related to both the wireless channel conditions (fading, path-loss, presence of obstacles, crowdedness of users in the area) and to congestion into the transport network. On this has, MEC has the two-fold advantage of reducing the RTT and PL related to congestion in the transport network.
The results presented in Figure 5 may represent a framework to adopt some adaptive strategies for dynamic placement of audio contents or audio quality adjustment based on packet loss experienced by the users and targeted number of users (as it happens, for example, in Dynamic Adaptive video Streaming over HTTP (DASH), [65]). However, this aspect is out of the scope of this work and represents a future direction.
In order to quantify the advantage in terms of supported users for the MEC scenario with respect to the cloud case, we introduce the supported users gain metric Γ which is given by where N U,M EC and N U,CLOU D are the number of users supported in the MEC and cloud case, respectively, and are calculated according to Eq. 2 and Eq. 1 by assuming specific values for RT T and P L for the two scenarios. Thus, Eq. 4 can be expressed as follows: where P L M EC and P L CLOU D are P L values for MEC and cloud respectively; RT T M EC and RT T CLOU D are RT T values for MEC and cloud respectively. α and β represent the ratio between clod and MEC PL and RTT, respectively. Figure 6 shows the performance gain for different values of the parameters α and β. Results show that gain of such a system in terms of supported spatial audio users grow more rapidly with the reduction of round trip time compared to a reduction of packet loss. In other words, to move the function responsible for the audio spatialization at the edge of the network, is more convenient when the edge elaboration reduces significantly the experienced latency. When the elaboration at the edge introduces a reduction of ten times for both the latency and the packet loss, a MEC approach is able to support a number of users up to 30 times larger with respect to a cloud one.
As α and β can be derived by observing the network behavior and user performance, the results shown in Figure 6 can be utilized by the MEC platform manager to perform the deployment of the spatial audio service at the edge based on achievable gain and targeted performance. This paper presented a possible sound spatialization application for cultural heritage enjoyment exploiting many advantages of 5G architecture. The proposed solution allows to guarantee almost real-time sound spatialization as a function of users' movements for the most generic client device. This can be achieved by exploiting the MEC paradigm, allowing to move the computational load from the client to the edge without involving the core network. Results also demonstrate that this scenario is able to serve an increased number of users w.r.t. a cloud based one, thus making this solution even more appealing for the CH Italian scenario, which comprises many big museums with a high daily number of visitors. From a more general point of view, the presented application is another demonstration of the fundamental importance of the MEC paradigm as an enabler of new services and business models.