Design and implementation of a VoIP PBX integrated Vietnamese virtual assistant: a case study

ABSTRACT As digitization is integrated into daily life, media are increasingly transferred over the Internet. Voice-over-Internet Protocol (VoIP), the most popular media transfer technology, is attracting many researchers and investments. The application of Artificial Intelligence (AI) technology into the Private Branch Exchange (PBX) has played a pivotal role in enhancing the customer experience and is able to unite employees in any company. One technology application used to optimize customer experience in a call centre is the use of an automatic PBX integrated with a Virtual Assistant (VA), which interacts directly with the PBX through voice and in multiple languages without any keystrokes. The Interactive Voice Response (IVR) module forwards the customer’s call to an operator or supports automatic processing. This solution can help businesses to handle thousands of calls per day with optimal performance, thus creating a customer care campaign that quickly reaches many users. A PBX integrated with Vietnamese Virtual Assistants (VVA) on an AI technology platform will also help businesses to cut down on operator costs with automated calls. Through comparison with a traditional PBX, this article analyzes, evaluates and optimizes an automatic PBX system with integrated VVA, thereby offering efficient solutions for interest companies.


Introduction
Voice-over-Internet Protocol (VoIP) comprises a set of software and hardware technologies for making voice calls that use a data network instead of a traditional Public Switched Telephone Network (PSTN) system. VoIP is widely used in corporate environments, and the adoption of this technology by businesses is expected to continue growing in the coming years (Packer & Reuschel, 2018). The main reason for the popularity of this model is cost saving. Both large and small companies acknowledge that deploying and managing separate data and voice networks is expensive. In contrast, converged voice and data networks enable unified communications services while reducing costs. Furthermore, the costs associated with traditional phone calls are usually higher than those associated with VoIP calls (Karapantazis & Pavlidou, 2009).
One essential component of a VoIP network is a PBX. In the context of corporate websites, Asterisk has become the most convenient solution. Asterisk is an open-source Linuxbased PBX gaining momentum in the VoIP industry (Martin et al., 2018;Montazerolghaem et al., 2016;Nuño et al., 2020;Senthil Kumar et al., 2015;Suwannaraj & Boonkrong, 2014). Asterisk enables development of an inexpensive rich IP telephony service (Dabbebi et al., 2013) supported by standard and widely used protocols such as the Session Initiation Protocol (SIP) and Real-time Transport Protocol (RTP). Asterisk facilitates communication between terminals with one or more lines connected to the PSTN. In this way, the terminals can establish calls over the Internet or with other terminals in the PSTN. Other Asterisk functions include voicemail, call scheduling, and Interactive Voice Response (IVR).
Many VoIP applications have promising prospects, for example in the area of multiplayer gaming (Chakraborty et al., 2019;Nugent, 2015). Today, games offer players realistic models, credible backgrounds, and the ability to use the Internet to connect with millions of other gamers worldwide. The chance for multiplayer gamers to cooperate and play has been enabled by VoIP, which has made games more enjoyable despite its shortcomings. Its main advantage is the seamless integration of gaming software with VoIP applications, which allows users to rapidly immerse themselves without switching to separate windows to share information with other players. VoIP also allows users to invite players to a game during runtime without interruption, thus maintaining engagement and growing the popularity of game applications.
Multi-conference applications also rely on VoIP, so that professionals from across the business and service sectors can instantly connect 'on the fly'. In addition to the costsaving benefits derived from VoIP conferencing, users can also take advantage of addons such as checking voicemail over the Internet, attaching messages to emails, and sharing files while chatting. Furthermore, social networking applications have also grown significantly with the arrival of Facebook, LinkedIn, and Twitter (Chakraborty et al., 2019), while other popular Web-based services such as Google Meet and Zoom allow people worldwide to connect and share ideas. These sites invite users with registered accounts to meet others with similar interests and interact via text, audio, and video chats. Since most of these service providers operate without collecting any money from users, VoIP is the optimal choice for implementing the necessary utilities.
In the last two years, as the COVID-19 pandemic hit globally and hundreds of millions of people were infected, hundreds of countries closed their borders. During this time the importance of technology applications such as VoIP has played a significant role in all fields, including online education, conferencing, and buying and selling on Internet technology platforms.
Integrating developing technology trends into customer care is considered an inevitable future development. The number of calling customers consistently overloads traditional PBXs, and the traditional automated handling and interaction systems typically used to control simple interactions and long calls are not very efficient (Brambilla & Molinelli, n.d.;Guzman, 2019). However, with the use of Artificial Intelligence (AI) complex queries can be resolved more quickly. A Virtual Assistant (VA) can automatically assist customers by solving frequently asked questions and customer problems according to available scenarios. In such a scenario, a VA can easily access base warehouse information and internal data to find answers for customers without consulting multiple sources.
One of the most visible advantages of businesses using VAs is to save staff training costs and meet the ever-growing needs of customers while optimizing operating costs. Many companies already delight customers by providing quick access to required information while reducing their costs. In fact, recruiting and training employees consumes a lot of time and money for customer service companies and departments. In addition to building a VA to save training costs, many businesses have now switched from using a traditional PBX to a VoIP PBX built on the Cloud to save setup and transport costs. As a result, it is easier than ever to operate, use, expand, and troubleshoot (Cortés-Mendoza et al., 2016). To the best of the authors' knowledge, this article is the first to build and develop a VoIP PBX Vietnamese Virtual Assistant (VVA) in Vietnam, which has been tested in a natural environment at the Ton Duc Thang University. In addition, the transcription of call audio recorded by the VA provides meaningful context about incoming calls to a user when the phone rings.
The major contributions of this study are as follows: . The article presents one of the first solutions that applies AI technologies to a VoIP PXB in Vietnam. Currently, the use of AI technologies to create VAs and integrate them into a Cloud-based VoIP PBX has been rigorously researched or implemented by few companies. As such, combining these different techniques and technologies on a VoIP PBX represents a great challenge. . This study uses an open-source VoIP PBX (specifically Asterisk) built on the Google Cloud platform. This brings the benefits of saving costs and increasing the flexibility, safety, and stability of real radio compared to designing a complex infrastructure and operating and maintaining the system. . The article evaluates and selects a Speech to Text voice recognition system that supports Vietnamese with high accuracy, even with many types of environmental noise such as interference from the street, rain, or other people. The voice received from the customer by the VoIP PBX is thus transferred to the voice recognition system for processing. . The developed VVA aims to integrate the PBX with voice input data and build a PBX on the Cloud platform, which are new and innovative features. This system should provide an enhanced customer experience and improve business, particularly since calls handled by VAs can cost less than calls made directly to an operator. . The difference between the solution proposed in this article and Mobifone's mAICall-Center is the application of the product (https://itc.mobifone.vn/giai_phap/maicallcenter/). Our novel solution is intended for 'calling in' to the PBX according to customer needs. The call-in scenario is unpredictable, while the mAICallCenter only executes 'call out' campaigns and is scripted in advance. . The results of this study demonstrate that the model is only suitable for Vietnamese speech and text processing in the context of returning results found in the database.
Integrating the VVA Penny into the PBX of Ton Duc Thang University will increase the rate of handling calls and simultaneously to the PBX while ensuring the expected response. Therefore, the VVA can replace traditional support agents (reducing 2/3 of the total personnel), thereby helping the University to save operating costs compared to the use of a traditional PBX. The VVA works well with pre-prepared and trained questions and statements. However, the accuracy of the VVA needs to be further improved to be able to recognize more complex queries and commands.
The remainder of the paper is organized as follows. Section 1 introduces the motivation and relevance. Related works are presented in Section 2. Section 3 briefly describes the background and basic concepts of VoIP. A system architecture is designed and developed in Section 4. Experimentation and evaluation are treated in Section 5, before Section 6 discusses the work. Finally, Section 7 offers concluding remarks and outlines future work.

Related works
Most traditional media, including telephones, music, movies, and television, have in recent years been digitized for transmission over the Internet. VoIP refers to a group of technologies, communication protocols, and transport techniques for delivering voice and multimedia over an Internet protocol network. Since Skype was released, many people have come to realize the convenience of voice and data transfer over the network. Indeed, since the mid-1990s, telephone equipment manufacturers have been adding IP capabilities to their existing PBX telephone switches. VoIP phones now offer an alternative to traditional telephony. As VoIP technology matures, achieving better quality of service (QoS) with VoIP has been increasingly studied (Chakraborty et al., 2019). than the other methods. Fourth, a novel scheme was presented to unveil encrypted network traffic and identify tunnelled and anonymous network traffic (Islam et al., 2021). Deep learning was also used to identity anonymous network traffic and extract the voice over VoIP. The experimental results showed that the scheme was considerably more robust than a virtual private network and onion router. More recently, literature reviews have presented the history and applications of such technologies, including Rasa chatbot technology (Adamopoulou & Moussiades, 2020; https://rasa.com). Indeed, VoIP PBX combined with Rasa technology is a new trend chosen by companies for deploying PBX phone system VA-integrated AI technology has been applied in many fields, including education, insurance, technology, and trading, on all technology platforms. In recent work, several VA-integrated AI technology systems have been proposed; e.g. (Arora et al., 2021;Rawassizadeh et al., 2019;Shanthini et al., 2020). One review paper (Rawassizadeh et al., 2019) discusses opportunities, challenges, and forthcoming trends as well as a future concept for VA in daily life. Meanwhile, Shanthini et al. (2020) demonstrates an AI BOT, where the authors present a natural language user interface. The combination of several AI algorithms and open-source tools are used to design an intelligent personal the results of the paper show that the open-ended requests than the specific tasks in multiservices to artificial brain. The research presented in Arora et al. (2021) studied and analyzed the working models and efficiency of different VAs available on the market, such as Apple's Siri, Google Assistant, and Amazon's Alexa. The experimental results provide a comparative analysis of the traffic and message communication with length of conversation over approximately three days.
After a systematic review of several relevant studies, we believe that this solution is the first to be used in Vietnam and within the framework of a university. In addition, it is superior to the mAICallCenter PBX, which also uses AI technology (https://itc.mobifone.vn/giai_phap/maicallcenter/). Overall, this research aims to fill the gap in existing work by linking together the topics addressed in previous studies.

Introduction
VoIP is a technology used to transmit the human voice over a computer network with the Transmission Control Protocol/ Internet Protocol (TCP/IP) protocol set (Giambene, 2014; Karapantazis & Pavlidou, 2009;Packer & Reuschel, 2018). It uses IP data packets (on the Local Area Network (LAN), Wide Area Network (WAN), and the Internet) to transmit information as encrypted by sound. This technology is based on packet switching, which replaces the use of channel switching by previous voice transmission technology. It compresses (splits) multiple voice channels on a single transmission line. These signals are transmitted over the Internet, thus reducing costs. VoIP allows calls to be made using a broadband connection instead of an analogue phone line, and many VoIP services only allow users to call others using the same service. However, some services allow users to call other people using local, long-distance, mobile, or international numbers. While some services only work through computers, others use a traditional phone through an adapter. The principles of VoIP operation include: . Digitizing the voice signal. . Compressing the digital signal. . Splitting the packets if needed. . Transmitting the packets over the network to the destination where they are reassembled.
Through decoding the analogue signal, the original voice is restored according to the order of the message. To do so, IP telephones, often with built-in calibration protocols such as SIP or H.323, connect to the IP PBX of the business or service provider. An IP phone can be a regular phone (instead of connecting to a telephone network via an RJ11 communication line, an IP phone connects directly to the LAN via an Ethernet cable, RJ45 communication) or voice software (softphone) installed on a computer.
. Gateway: This component helps to convert analogue signals into digital signals (and vice versa). . VoIP gateways: These act as a bridge between the regular telephone network PSTN and the VoIP network. . VoIP server PBX: The central server has the function of routeing and securing VoIP calls.
In the H.323 network, it is called the gatekeeper. In an SIP network, servers are known as SIP servers. . End-user equipment, i.e. a softphone and personal computer: This includes a headphone, software, and an Internet connection. Popular free software includes Skype or MS Teams. . Phone to communicate with IP adapter: to use a VoIP service, a regular phone must be attached to an IP adapter to connect to the VoIP server. An adapter is a device with at least one RJ11 port (for attachment to the phone), an RJ45 (for attachment to the Internet or PSTN line), and one power port. . IP phones: phones used exclusively for VoIP networks. IP phones do not need a VoIP adapter because they are built-in to enable direct connection to the VoIP server.

Using VoIP
When the user speaks into a headset or microphone in VoIP, the voice produces an electromagnetic signal, which is an analogue signal (Giambene, 2014; Gohel & Lakhtaria, 2010;Karapantazis & Pavlidou, 2009;Martin et al., 2018;Montazerolghaem et al., 2016;Packer & Reuschel, 2018;Senthil Kumar et al., 2015;Suwannaraj & Boonkrong, 2014). Analogue signals are converted into digital signals with the use of a unique algorithm. Different devices, such as VoIP phones or softphones, use different conversion methods. If using a regular analogue phone, a telephone adapter (TA) is needed. The digitized voice is then encapsulated and sent over the IP network.
The basic steps to make a call in VoIP are as follows: . The caller determines where to call (e.g. country code, province code) and dials the number to call. . Connections are established between the caller and the receiver. . When speaking into a headset or microphone the voice produces an electromagnetic signal, which is an analogue signal. These are converted into digital signals using a unique algorithm. The digitized voice is then encapsulated and sent over the IP network. During the process, a protocol such as SIP or H323 controls the call by setting up, dialling, or disconnecting, for example. A Real-time Transport Protocol (RTP) is used to ensure reliability and to maintain the quality of service during transmission. . Data are transferred over the initially established connection. . Data containing the spoken sounds are converted back into sound that can be understood by the listener. . Finally, the spoken sound is played on the receiver's side.
The process of digitizing analogue signals: Representing analogue signals in the digital form is a difficult task. Since the sound form itself, like a human voice, is analogue, many digital values are required to represent amplitude, frequency, and phase. Consequently, converting those values into the binary number form (0 & 1) is very difficult. Performing this conversion requires a coder-decoder device or an encoder and decoder. The analogue signal is applied to the input of this unit and converted into binary digital sequences at the output. The process is then repeated by converting the binary number into the analogue terminal at the end. There are four steps involved in digitizing an analogue signal: sampling, quantization, encoding, and voice compression. Having summarized and reintroduced the definitions and principal components of VoIP technology used in this article, the next section introduces the proposed system design and development.

System design and development
The main contribution of this paper is to design and implement a VoIP PBX (Asterisk)-integrated VVA built on the Google Cloud platform (https://www.asterisk.org; Dodda & Nulu, 2018;Google Cloud, Quotas & Limits, 2020). The article uses AI technologies such as Speech to Text (STT), Text to Speech (TTS), and a Rasa chatbot (Dodda & Nulu, 2018;Rasa, 2020;Google Cloud, Quotas & Limits, 2020), which is continuously trained using Vietnamese conversation samples to convert natural language into structured data. The meaning of the contribution is explained below, and the conceptual architecture is shown in Figure 2.
This work will use a Rasa Voicebot called 'Penny.' Penny will receive the input text transferred from the STT engine of the customer's voice to the call centre. Penny then decides on the PBX branch and redirects to the IVR branches for the operator to receive or interact with customers via the TTS engine, converting text from Penny's output into voice. The representative from TTS is then transferred to the Asterisk PBX to interact with customers. .
Step 3: The call received from the Asterisk PBX is converted by the STT engine to convert the customer's voice into text. This input text data provides information for Penny to process according to the script: -If the customer's needs lie within Penny's interactive responsiveness, go to step 4.
-If the customer needs to reach the operator for support, Penny will ask if the customer wants to meet the operator. If the customer confirms 'Yes,' go to step 5. .
Step 4: Penny responds to textual output and converts to voice from the TTS engine to interact with customers. .
Step 5: For customers wishing to meet the operator, the BOT Supervisor analyzes customers' needs based on the keyword to transfer customers to the correct support branch so that the operator can receive calls without asking the customer to press the key like a regular PBX.

Speech to text/Text to speech technology
STT is the conversion of spoken language into text (Dodda & Nulu, 2018;Google Cloud, Quotas & Limits, 2020), also known as 'Automatic Speech Recognition' (ASR), 'Voice to Text,' 'Voice recognition,' or 'computer speech recognition.' The application software performs voice recognition by measuring a set of numbers that represent the voice signal. The signals can be divided into sections that contain different words or phonemes. In each segment the voice signal is characterized by the density of energy in various frequency bands. Although the details of signal representation are outside the programme's scope, it is possible to represent the signal with a set of actual values. Meanwhile, TTS is a type of technology that supports reading aloud digital text (Dodda & Nulu, 2018;Google Cloud, Quotas & Limits, 2020), sometimes referred to as 'read aloud' technology. TTS can take text from a computer or other digital device and convert it into audio, and works with almost any personal digital device, including computers, smartphones, and tablets. All types of text files can be read aloud, including Word and Pages documents, in addition to online web pages. Speech in TTS is computer-generated and read at normal speed, and can be accelerated or decelerated. Voice quality varies, but some voices appear human depending on the algorithm and machine learning training. Some computer-generated voices even sound like children. Some TTS tools also incorporate Optical Character Recognition (OCR) technology, which allows TTS engines to read aloud text from images.
Currently, there are many Applications Programming Interface (API) resources for STT/ TTS technology on the market that assist users to choose one or the other. However, when it comes to audio files, especially the audio data of call centres, the task of applying STT/ TTS technology to the PBX becomes more challenging. Unlike having to wait for the end of the call, calling and then executing STT/TTS to transform data, thus creating VA for the PBX, requires real-time data. If a customer service call has an approximate duration of 10 min, continuous voiceto-text/text-to-voice conversion is required to ensure that the VA can continuously understand and process the data and make the right decisions. For such a scenario, only a handful of API resources available on the market can handle this type of data (e.g. Google, Amazon, IBM, Microsoft, Nuance, and Rev.ai). In this work, Google Speech to Text API is chosen because of its low cost and good support for languages, including Vietnamese.

Chatbot natural language processing technology
Chatbots are attracting strong attention from the technology world (Adamopoulou & Moussiades, 2020;Delhi, 2019). Now considered as a future platform of AI, chatbots touch most aspects of life and are widely used. Many large technology companies are currently developing chatbots, including IBM (Watson), Facebook (Wit.ai), Google (Dialogflow), Microsoft (Azure), and Amazon, and each company is doing so within its own framework. The study aims to construct a chatbot capable of understanding users' intentions, interacting intelligently, performing actions if users request, providing an effective learning mechanism, and avoiding the use of any paid service. Many online services, including Facebook's Wit.ai, offer good natural language understanding (NLU) (Bird et al., 2009) features but charge for traffic. Google's Dialogflow also offers highly efficient and appreciated NLU but does not support Vietnamese. This leads to the desire to find an open-source application offering full independence and complete control over building bots. The aim is for private company use, so it is not necessary to share data, and most tools available today are Cloud-based and provide software as a service. Moreover, it is necessary to submit data to third parties, so it is impossible to run these tools internally without an Internet connection. The chatbot developed by Rasa is a highly suitable application, as it meets all of the above requirements. That is also why this study chose to use Rasa to build the first chatbot called 'Penny.'

Rasa chatbot technology
Rasa is an open-source application with a machine learning framework for building chatbots or AI assistants, offering simple customization and complete control over building bots (https://rasa.com; Rasa, 2020). With Rasa the user can create, deploy, or host Rasa locally on their server or environment with full control. Rasa also supports external triggers to integrate into applications such as Google API or REST API. Rasa comprises two components: . Rasa NLU: A library for NLU that classifies intents and extracts entities from user input, and helps the bot to understand what the user is saying. . Rasa Core: A chatbot framework for machine learning-based conversation management that takes structured input from the NLU and predicts the following best action using a probabilistic model like a long short-term memory (LSTM) neural network.
Rasa NLU and Core are entirely independent of each other; it is possible to use NLU without Core and vice versa, although Rasa recommends using both.

Rasa NLU
This section focuses primarily on NLU, which includes extracting intent, classifying & entity, and producing a structured output that can be included in Rasa Core (https:// rasa.com). As summarized above, to teach the chatbot to understand the most basic messages, it is necessary to train the NLU model with inputs in a simple text format and extract structured data. To achieve this, it is essential to define the intent and provide a few patterns through which the user can express them.
It is thus necessary to create some files like the following: . NLU training file: Contains some training data about user input and the mapping of the intentions and entities included in each of them. The more examples are provided, the better the chatbot NLU's capabilities become. . Stories file: Contains sample interactions that users and chatbots will have. Rasa (Core) creates a pattern of possible interactions from each story. . Domain file: Lists all intentions, entities, actions, patterns, and other information. The mentioned templates are a sample chatbot response that can be used as an action.

Create Bucket for storage on GCP
To store voice recording files when a customer calls the PBX system on the Cloud, it is necessary to create a bucket (Dodda & Nulu, 2018) large enough to store hundreds or even thousands of recording files to serve the needs of training voice recognition chatbots in the future.

VVA training 'Penny'
As mentioned in the previous section, Rasa is a very suitable application for practical use and fully meets users' needs. Therefore, this article chose to use Rasa to build the chatbot, which is called Penny. In this sub-section, we introduce the steps taken to build the Penny VVA.

Rasa NLU setup
Rasa NLU is an open-source natural language processing tool for intent classification and entity extraction in chatbots (https://rasa.com; Rasa, 2020). The recommended way to install these steps is as follows: perform docker and docker-compose setup, train NLU and dialogue models with a command line, and create a virtual environment using the command shown in Table 1. Screenshots of these three steps are shown in Figures 3  and 4.

Train NLU and dialogue models
To train and simulate the NLU and dialogue models in the Rasa environment, the command line is deployed as shown in Table 2 and the screenshot in Figure 4. It will be processed in the training data.

Test Rasa NLU
After training the NLU and dialogue models, we continue to perform the Rasa NLU test via the command line and the screenshot shown in Table 3 and Figure 5, respectively.

Rasa architecture
After systematically analyzing the technology components needed to design and build a VoIP PBX (Asterisk) with a VVA, we continue to the deep structure of Rasa, which is the most important feature for system design in this paper.    Figure 6 shows a simplified summary of how Rasa processes a message. The user's input (the message) is passed to the interpreter (Rasa NLU), where the intent and entities are extracted. This data is added to the tracker, which keeps track of the current state of the system. The following step is to invoke the policies that chose which action to perform next. The tracker is updated accordingly, and the message is output to the user. Since Rasa is open-source, all Rasa modules are expendable and interchangeable. One can add custom steps to the Rasa NLU pipeline or define custom policies for Rasa Core. Rasa also uses a friendly Yet Another Markup Language (YAML) format for training the AI.    Step 1: The message is received and passed to the Interpreter, which converts it into a dictionary that includes the original text, the intent, and any found entities. The NLU handles this part.
Step 2: The message is transferred from the Interpreter to the Tracker, an object that tracks the status of the conversation.
Step 3: The current status of the Tracker is sent to each policy.
Step 4: Each policy chooses which action to perform next.
Step 5: The tracker records the selected action.
Step 6: A response is sent to the user.

Build intent/entity
NLP was built to identify the user's intent or the intent of the message and extract the entities contained in that message. For example, in the sentence: 'I want to look up the scores for the entrance exam for Ton Duc Thang University,' the information returned by the NLU module is as shown in Table 4. The training data for the NLU model can be defined in the JavaScript Object Notation (JSON) or markdown format, and the training data has the following form as shown in Table 5.
Here four intents are defined for each user intent that the chatbot can support, in addition to one entity called product name. If products are referred to by multiple names, the Entity Synonyms syntax is used for mapping as shown above.

Custom tokenization for Vietnamese
The work is also quite simple; a tokenizer function for Vietnamese is found in the file vi_tokenizer.py in the directory rasa/nlu/tokenizers of the Rasa library and registered in in/rasa/ nlu/registry.py. File contents vi_tokenizer.py (Righ Now, 2010).

Sentiment analysis module
This sub-section demonstrates how Penny adds positive or negative ratings from user messages. First, following the instructions from the Rasa docs, a sentiment_analysis.py file is initialized as shown below in Table 6. The next step is to edit the config.yml file to use this custom component. Finally, we train and obtain the results by using Rasa shell with the command line 'rasa train nlu' as shown in Table 7.
These workflows can be challenging without a user interface tool. Rasa X is to collect chats with the user to review the user's progress and make decisions on how to improve the VA. {"intent": "Look up", "entities": { "find_type" : "Scores", "school" : "Ton Duc Thang University " }} Table 5. Command line for training the NLU model. For example, in the case of a chat where the assistant performs exceptionally well, the chat can be saved directly to stories training. Conversely, if intent has been misclassified, the classification error can be corrected and the annotation data saved in the training file. However, not every type of update should be done directly in Rasa X. While reviewing previous chats, it may be concluded that two intentions should be merged into one or the behaviour of one. Custom actions require change. In these cases, Rasa X is a valuable tool for determining which changes need to take place. super(SentimentAnalyzer, self).__init__ (component_config) def convert_to_rasa (self, value, confidence): """Convert model output into the Rasa NLU compatible output format."" entity = {"value": value, "confidence": confidence, "entity": "sentiment", "extractor": "sentiment_extractor" return entity def process(self, message, **kwargs): key, confidence = sentiment(message.text), 0.5 entity = self.convert_to_rasa(key, confidence) message.set ("entities", [entity], add_to_output=True) from rasa.nlu.components import Component Table 7. Command line for sentiment analysis model (config.yml) and Rasa train NLU.
The installation is started with the following command: curl -s get-rasa-x.rasa.com | sudo bash In steps 1 and 2, when receiving a message from Google STT the chatbot checks to determine if this is a greeting corresponding to the intent greet, and forwards the following information to action_lookUp_score to identify the appropriate information from the user. with no database. If the provided data matches the existing database, the chatbot moves to step 5 and converts the output text to Google TTS to respond to the user. If the provided data is different from the existing database, the chatbot moves to step 4 and asks the user to repeat the conversation. In this article, the Rasa chatbot algorithm model is applied as shown in Figure 8.

Build Penny VA
For module testing, the user calls Asterisk PBX number (*299). A text file is generated and sent to module TTS and module STT to receive text or audio sources. Furthermore, requests are executed to the server using its API. The log files are then written and stored temporarily on the system.
In this work, the procedure for testing the TTS module proceeds as shown in Figure 9. First, the user calls the PBX and hears the greeting, then responds by voice. The start time is written to the log file. The application must check if it can connect to the server. Otherwise, it displays an error message and ends the test. It sends the input text to the server and waits for the server's TTS transition phase. If it does, it then receives the processed audio file and replays the voice message to the user through Asterisk.
The user then responds by voice through the conversation on hold. The audio file is sent to the STT application. If the application's received accuracy (confidence index) is greater than 0.5 (this value can be increased or decreased depending on the purpose), the text data will be fed to the Rasa chatbot. The Rasa chatbot returns the result and sends back the TTS module in response to the final audio output to the user. If the confidence value is received ≤0.5, then it continues to return to module STT. Finally, the end time, confidence metric, and Rasa chatbot (bot response) results are logged in the log and audio files, and testing is complete.

Software development
The front-end languages used to create graphical user interfaces (GUI) in this article are hypertext markup language (HTML), cascading style sheets (CSS), and JavaScript (JS).
The Bootstrap framework and the JQuery library are also used due to their customizability, speed of development, and ease of use. Moreover, for back-end development, Python and Perl languages are used. Asterisk's Monitor library is used to record voice commands and encode them into a.wav file before performing any analysis tasks. Rasa chatbot's modules were chosen because it is free and supports the Python language. Additionally, contacting for Rasa support is easier and faster than consulting a forum of many members worldwide. In addition, the provided tool has the advantage of supporting local languages, including Vietnamese, unlike similar products provided by AWS or Microsoft.

Experimental results
In this section we first report the results of the experimental performance test and evaluate the VVA PBX system availability, to determine the capacity of concurrent calls and bandwidth consumption depending on different hardware configurations. Secondly, Penny the chatbot is assessed with an example user call to the operator comprising a conversation about looking up test entry scores to the University, to check whether Penny understands or not. Penny is also tested with off-topic conversations to see how it responds.

Analyze and evaluate system usability
This sub-section shows that the analytical performance tests and evaluates the PBX system's availability to know the capacity of concurrent calls, bandwidth consumption depending on different hardware configurations on GCP. In this performance, it is necessary to install two virtual machines on GCP. It should be noted that virtual machines on GCP must have different capacity configurations. Because if it has less capacity, it will not cause stress to test the system's maximum capacity.
Note: This system test should not last more than 10 min due to the length of the audio file. Figures 10 and 11 show a virtual machine that does not use any codecs and does not record calls with difference settings. In total, up to 80 concurrent calls were achieved using 94% Central Processing Unit (CPU) processing and only 27% memory processing, concluding that Asterisk requires the most CPU resources when making a call. Accordingly, CPU parameters have the most significant impact on the system. When adding call recording, the reduction in call capacity is 25%, with 80-60 concurrent calls. The most obvious difference is that when using G.729 codec, capacity is reduced by 50%, and reaches up to 60% less when adding call recording with G.729 codec. In conclusion, the use of the codec affects the performance of the PBX to an extent. It is not advisable to use codecs, as the bandwidth provided by Internet Service Providers (ISPs) is far greater than in previous decades, where compression was needed. Figure 12(a) shows that when increasing the number of concurrent calls to a high level, the percentage of CPU load also increases. So, if the ratio of concurrent calls is 150, then CPU Core = 2Gb has %CPU = 153%, and expanded to 4Gb, it has 91% left. Should base on this analysis, to meet the current system, we should choose CPU Core = 4Gb Ram is suitable for use. Furthermore, in Figure 12(b), we can see the difference when changing the CPU Core, but changing the RAM does not affect the increase concurrently. Finally, we can conclude that memory is not essential for concurrent calls in Asterisk; however, in some cases, this is necessary since it is not only the Asterisk application executed on the GCP virtual machine. Therefore, for the PBX system built on the GCP cloud platform, it is possible to quickly increase or decrease RAM and CPU parameters during peak or low times of business activities based on production, the number of calls to call centres increases or decreases, helping businesses to save labour costs as well as improve system operation.
To apply the system in practice, it is assumed that Ton Duc Thang University is hiring the PBX service of Minh Phuc Telecom for five customer service (CS) agents with a 12-month  contract. However, during enrolment season for the new academic year, the number of consultant PBXs is doubled to 10 to respond to the high number of calls, resulting in a rapidly doubled cost of operating the system. Thus, in lower peak times not all the CS agents are needed, causing wasted running costs of the school's call centre. In this case, applying the construction of a PBX on the cloud platform can satisfy the shortage of system resources during peak hours simply by upgrading the CPU and RAM on GCP (at a cheaper cost) and drops during off-peak hours without much impact. Due to the common use of a system built on a GCP platform, network problems and latency are not evaluated in this article.

Evaluation of penny chatbot
The Rasa chatbot was chosen for this research because it is free, open-source code that supports Python language. Furthermore, contacting Rasa for support is easier and faster than consulting a forum. The provided tool also has the advantage of supporting local languages, namely Vietnamese, unlike similar products provided by AWS or Microsoft.

Dataset
. Module testing: For the TTS testing module, texts are randomly selected from the Zing-News newspaper site to use with three cases. In particular, short texts (words or sentences), medium-length texts (short clauses), and long texts (paragraphs) are selected. For STT module testing as shown in Table 8, audio files data available in.wav format are used. . Voicebot testing: The user calls the operator and sends a conversation about looking up test scores to check if voicebot Penny understands or not. Penny is also tested with off-topic conversations to see how it responds.   (Anh, 2017) to accurately separate Vietnamese words, including the separation of compound words. The dataset collected by this source includes many acronyms and misspellings, so it is necessary to standardize the data by building dictionaries to handle this problem. Labelling of intents is then implemented, namely: hello, price, quantity, payment, review; in addition to labelling of entities, namely: product code, address. After conducting the preprocessing step and manual labelling, data are trained using Rasa, the modular testing algorithm shown in Figure 13. Note that: In this paper, we experiment to classify intents and extract the described information, but we do not describe it in detail here. Figure 14 shows the Voicer Server configuration, which is integrated into the Penny Voicebot and developed to conduct communication more quickly between VoIP and Penny PBXs. Latency sensitivity is also ensured to give users the feeling of talking with an operator rather than a machine. As such, the paper builds a more voicer server on the AGI server (Bryant et al., 2013) of the Asterisk PBX in Perl language.

Voicer server setup
Included in Figure 15 is 'utterance', the variable that returns the text voice result from the module STT (googleasr.agi). Meanwhile, confidence is the variable that returns the accuracy result when executing the STT from the Google API. All of these results are stored temporarily at /var/spool/asterisk/monitor/result_call.txt,,,al,u)=$STRFTIME Figure 13. The algorithm of a modular module test.

Confidence impact
The results obtained in the above test aim to classify the intentions from the input sentence and extract meaningful information. While an ideal pattern would miss no data, this is not always possible. Over time, as we improve the Voicebot's training data (i.e. confidence index), the orange columns shown in Figure 9 begin to converge along the left side of the chart, with lower confidence levels shown for failed tests.
Moreover, it is possible to correlate low confidence with the possibility of missing a prediction. The Voicebot's confidence threshold (i.e. confidence index) can help to handle such cases. The model is less confident in the forecast when asking users to rephrase their questions example.
In this article, Google STT and TTS were used to develop and test a Voicebot that supports queries about university admission results. Test programming was conducted based on the Rasa platform to classify intent and select information. Conducting the test shows that the rate of handling calls and responding correctly to information is 91.67%. This ratio is based on the end user's total call to the switchboard on the actual environment calling the switchboard and is calculated based on the admission response that matches the database's information. The inaccurate response rate reached 8.33% due to low confidence in the feedback (see Figure 15), high user noise, and interference resulting in no results. STT is lack of accuracy and Voicebot cannot correctly understand this false information to the above error.
As seen from the results obtained in Figure 15 and Table 9, the model is only suitable for Vietnamese speech and speech processing in a context that returns the results found in the database. Thus, when integrating Penny VA into the PBX of Ton Duc Thang University it is tested on a dataset of 200 records (200 calls daily), where Penny VA supports Figure 15. The confidence of a test case. information related to enrolment results. This increases the rate of simultaneous calls to the PBX while ensuring the results as expected. Therefore, the VA on PBX can replace the traditional support operator (reducing the entire staff by two thirds), helping the school to save operating costs compared to the traditional PBX. The Voicebot works well with preprepared questions and claims, and following training. However, the accuracy of the Voicebot needs to be improved in the future to recognize more complex questions and commands.

Discussion
This simulation result shows the feasibility of the solution when the accuracy is equal to or greater than traditional approaches, comparing the throughput results before and after applying speech recognition and integration technology. Table 10 compares the results of our solution (VoIP PBX) with VVA and traditional PBX. Thus, performance results assess the effectiveness of the VoIP PBX with the proposed VVA approach compared with existing traditional PBX that are popular on the market.

Conclusions
VoIP has become a popular technology in modern communications. Many technology platforms such as Skype, Google Talk, and Facebook Messenger provide high-quality free phone services, and have become indispensable in the challenging circumstances of the COVID-19 pandemic. As services, companies, and schools are closed, and working and learning are taking place online, meeting through modern technology platforms such as Google Meet, Zoom, and Cisco Webex is increasingly popular. Indeed, technology platforms using VoIP technology have brought many benefits to individuals and companies.
The explosive development of AI technology has led to its application in numerous fields. In particular, the use of VAs with AI technology has been applied to PBXs around  the world. This research follows that trend by integrating Vietnamese VAs into PBX using AI technology. The results show that our study is considerably more effective than the previous product and is pioneering in its practical application. Continuing this research to improve the accuracy and increase the effectiveness of this solution will involve re-evaluating studies and making use of larger, multi-regional datasets to meet the reality, rigours, and high demands of the market.