A Modified Multiview Video Streaming System Using 3-Tier Architecture

. In this paper, we present a modiﬁed inter-view prediction Multiview Video Coding (MVC) scheme from the perspective of viewer’s interactivity. When a viewer requests some view(s), our scheme leads to lower transmission bit-rate. We develop an interactive multiview video streaming system exploiting that modi-ﬁed MVC scheme. Conventional interactive multiview video systems require high bandwidth due to redundant data being transferred. With real data test sequences, clear improvements are shown using the proposed interactive multiview video system compared to competing ones in terms of the average transmission bit-rate and storage size of the decoded (i.e., transferred) data with comparable rate-distortion.


Introduction
Multiview video consists of video sequences of the same scene captured time-synchronously by multiple closely spaced cameras from different observation viewpoints [1].Multiview Video Coding (MVC) [2] has been used to encode the multiview video signals using various proposed schemes including both temporal and inter-view predicted frames (i.e., frames are predicted not only from the temporally neighboring frames, but also from the corresponding frames in adjacent views).MVC typically focuses on increasing the Rate-Distortion (RD) performance for the compressed frames of all views as shown in [3], [4].Since users do not need all of the views at the same instance, transmitting the whole set of frames leads to consuming bandwidth resources.Nevertheless, decoding the compressed multiview video at the user side requires high computational cost and storage space.
An Interactive Multiview Video Streaming (IMVS) system [5] provides the aforementioned multiview video service efficiently and flexibly, to enable a viewer to freely interact with the multiview video data.The IMVS system has the advantage of reducing the bandwidth usage, since only the requested subset of the multiview video data is transmitted.However, the primary challenge in an IMVS system is to design a structure to encode the multiview video data with a reasonable compression efficiency [6] (i.e., having the transmission bit-rate reduced), having the RD performance increased, and having the storage size of the encoded multiview video data reduced.
Readers are referred to [7], [8], [9], [10], [11] for more details on IMVS systems.In [7], the IMVS system encodes the multiview video data with a simulcast mode.In such a mode, each view is encoded and transmitted independently, and each client receives as many needed views according to the channel bandwidth.Although such an IMVS system increases the interactivity between the user and the underlying requested view(s), redundant data is transferred at the expense of the quality of the transferred video for limited channel bandwidth.
In [8], a client-driven multiview video streaming system is presented to allow a user to watch 3D video interactively with significantly reduced bandwidth requirements by transmitting a small number of views selected according to the viewer's head position.That system makes use of MVC and scalable video coding concepts together to obtain improved compression efficiency.However, a base layer and enhancement layers of two selected views are additionally transmitted.
In [9], a similar IMVS system to that in [7] is designed to encode the multiview video data with a simulcast coding method, where the multiview video data is sent as two separate streams transported at separate Internet Protocol (IP) channels.However, simulcast encoded video still contains a large amount of interview redundant data and needs to by synchronized with inter-view switching.If two views need to be retrieved from the currently received view to another, those two views may have different end-to-end delays.Such discontinuity would negatively impact the viewing experience of end users.
In [10], an IMVS system uses successive view motion model that discriminates all frames into potential and redundant ones to be encoded and transmitted to the client.However, the performance of this system depends on Kalman filter-based predictor.If there are no prediction errors, high-quality streams are displayed.However, the predictor is not fully perfect.So, if the prediction is not fully correct, only the base layer (lowquality) is displayed and it brings poor user experience.
In [11], an encoding structure is presented to enable each view to be transmitted over a multicast group formed by clients requesting the same view.An optimal rate allocation algorithm is proposed to deliver the views selected by a client according to the network conditions.However, the decoding complexity should be maintained at a low level for a IMVS decoder due to various processing capability of terminal devices used by different interactive clients.
The IMVS system has the advantage of using a reduced bandwidth since only the requested data subset is transmitted.The primary challenge in an IMVS system is to design a structure to encode the multiview video data with a good compression efficiency, so that the transmission bit-rate is appropriately traded off with the storage size.
In [12], an MVC scheme that encodes the requested multiview video subset data, is presented.In that scheme, the inter-view prediction is performed only for the key frames to provide P -frames for both even and odd camera views.Whereas, the non-key frames of each Group of Pictures (GoP) are predicted with hierarchical B-frames in the temporal direction.In this paper, we extend the work to embed it as a first step of a proposed interactive multiview video streaming system, with a 3-tier architecture inspired from [13] (i.e., client, application server, and database server).The proposed IMVS system is compared to the state-of-theart IMVS systems in terms of transmission bit-rate (in kb•s −1 ) and pre-encoded data storage size (in kByte).
The rest of the paper is organized as follows.The proposed IMVS system including the MVC scheme used is described in Section 2. Implementation setup, data set sequences used and experimental results are shown in Section 3. Finally, conclusions are given in Section 4.

The Proposed IMVS System
A typical IMVS system consists of 5 successive steps: capture, encode, store, transmit and decode.First, the multiview video data is encoded using an encoding scheme.Then, the encoded multiview video data is submitted to a central server, called application server, in order to be stored in a video database that is available at the MVC database server.
The application server only needs to prepare and transmit video stream to each client once its request has been received.The video stream is then prepared by splitting the requested multiview video subset data from the whole multiview video set.The database management system at the MVC database server fetches the prepared video stream to be submitted to the application server that returns it back to the client(s).The application server can also reduce the resolution of the video stream to adapt to the available transmission bandwidth.The resolution reduction can be obtained by decreasing the number of video frames at the time domain.Finally, at the client side, there is a standard video decoder that decodes the retrieved multiview video subset data.The proposed IMVS system is based on 3-tier architecture: MVC encoding scheme, application server and MVC database server that are shown in Section 2.1., Section 2.2. and Section 2.3., respectively.The cost of splitting views and that of random access are presented in Section 2.4.and Section 2.5.respectively.

The MVC Scheme Used
This subsection shows the MVC scheme used in the proposed IMVS system that encodes the retrieved multiview video subset data.
The captured sequence is encoded by an MVC encoder that generates one merged stream.The generated bit-stream is submitted to the application server to be stored at the MVC database server.Figure 1(a) shows an example of the prediction structure of the proposed MVC scheme [12], with number of views, N , set to 8 and GoP length, M , set to 8. Setting the base view to S 4 , the inter-view prediction is performed only for the key frames at T 0 and T 8 to provide P -frames for even camera views (S 2 , S 0 and S 6 ) as well as odd camera views (S 3 , S 1 , S 5 and S 7 ).Whereas, the nonkey frames of each GoP are predicted with hierarchical B-frames in the temporal direction as shown in [14].Temporal scaling is shown in Fig. 1(b) can be applied to any multiview with more than two views.

The Application Server
In this subsection, the application server role is presented.The 3-tier architecture of the proposed IMVS system is shown in Fig. 2. The client selects a multiview video subset data stored in the MVC database to be decoded and displayed.The selection process acts as a request from the client to the application server.
The control module at the application server, receives and schedules the clients' requests, then asks the MVC database server to retrieve the requested view(s) from the MVC database to be transferred to the client.The scheduling process is performed according to the requested view(s), client's code, and available transmission bandwidth.If there are more than one request for the same view, the control module transmits that view over a multicast group formed by the clients requesting that view.The client can randomly switch between frames in both temporal and view-wise directions.
The cost of such a random access will be shown in Section.2.5.The control module checks for the available transmission bandwidth.In case of insufficient bandwidth, the requested view(s) will be accumulated at the application server.Thus, a stream delay will occur yielding buffer overflow.In such a delay case, the control module passes the video stream through the stream adaptation module.This module reduces the video stream resolution using temporal scaling in the time domain by decreasing the number of video frames within each GoP. Figure 1(b) shows the proposed temporal scaling at the hierarchical B-frames.It can be shown that the B-frames with symbol "B 3 " are not used as reference frames to encode others.Thus, those frames can be discarded to reduce the number of frames within one GoP before transmission, in order to adapt to the available transmission bandwidth.

The MVC Database Server
This subsection shows the role of the MVC database server in splitting the multiview video subset data in response to the client's requests.
The video database typically provides video preprocessing for content representation and indexing, storage management for video, and continuous video streaming [15], [16].The MVC database has the ability to split a requested view from the whole set of views to be transmitted to the client.The MVC extraction engine retrieves the requested view from the MVC database server by splitting that view to its references from the whole set of views.The output of the MVC extraction engine forms a MVC sub-stream to be submitted to the application server, before it is transmitted to the client.The cost of the view splitting step is discussed in the following subsection.

Cost of Splitting Views
As shown in Section 2.1., the MVC prediction structure consists of one base view, S b , and multiple enhanced views, S e .The S b is normally coded by singleview coding, and acts as a reference frame to encode other S e frames.For some view, S n , The splitting process is obtained by extracting its GoP series from each group of GoP stream.The number of extracted frames for one GoP can be generally formulated as: where E(•) denotes the cost function for extracting frames, b denotes the base view number, n denotes the view number to be encoded, α denotes the style of inter-view prediction at key frames; α ∈ {1, 2: 1 for standard style (i.e., is referred to as HBP), and 2 for sequential style}, β denotes the number of reference views for non-key frames; β ∈ {0, 1, 2}, R(•) denotes a function to determine the number of key frames in an inter-view prediction as of S b through S n , G(•) denotes a function to determine the number of non-key frames in scheme related with S n and l denotes the number of non-key frames in that GoP.The function R(b, n, α) in Eq. ( 2) can be cast as: where b, n ∈ {0, 1, 2, . . ., N − 1} and N denotes the total number of the views.As well, the function G(l, β) in Eq. ( 1) can be written as: (3) Generally, the cost, Cost E , of the extracted frames for splitting all GoPs can be cast as: To improve the view extraction performance, the Cost E , in Eq. ( 4), of each GoP has to be minimized.Therefore, the Cost E in Eq. ( 7) can be reformulated for a given encoding scheme, τ .
where each τ has its own parameters S b , α and β.
To solve the minimization problem in Eq. ( 5), we should better choose an encoding scheme to use in the proposed IMVS system.This choice step can be obtained by determining the Cost E for all candidate encoding schemes, considering that the best scheme yields the lowest Cost E value.

Cost of Random Access
It is worth noting that the random accessibility is the first step in interactivity.The user can access any single frame in either temporal or view-wise directions when watching a multiview video program [17].Random Access (RA) can be defined as the cost of accessing any frame in one video sequence.Therefor, RA can be considered as an evaluation performance metric for a candidate prediction structure of an encoding scheme.
The RA performance is measured by the number of frames that are needed to decode a specific frame in one GoP.In turn, the best encoding scheme should yield a minimum Accumulative Sum of the Reference Frames (ASRF) that can be formulated as in Eq. ( 6).
Where A(•) denotes the ASRF, b denotes the base view number; b ∈ {0, 1, 2, . . ., N − 1}, n is the randomly selected view number; n ∈ {0, 1, 2, . . ., N − 1}, F n,t denotes the frame at view S n and time t, l denotes number of non-key frames in GoP, and α and β are as defined in Eq. ( 1).The function P (F n,t , β) determines the number of reference frames for the frame F n,t and can be formulated as in Eq. ( 7).
Where Λ ≡ {1, 2, . . ., l} and ϑ denotes the level of non-key frame F n,t in an encoding scheme.The function H(b, n, α) determines the number of frames in the inter-view prediction and defined as in Eq. ( 8).
As well, the function D(F n,l ) determines a constant value according to the location of the frame in an encoding scheme and can be written as: For instance, in the proposed encoding scheme, shown in Fig. 1(a), the postscript of I-, P -, and Bframes denotes the level ϑ.For certain S b , the cost of random access, Cost R (in frames), can be determined as: To improve the random accessibility performance, the Cost R , in Eq. ( 10), of each GoP has to be minimized.Therefore, the Cost R in Eq. ( 10) can be reformulated for a given encoding scheme, τ , as: where each τ has its own parameters S b , α, and β.
To solve the minimization problem in Eq. ( 11), we should better choose an encoding scheme to use in the proposed IMVS system.This choice step can be obtained by determining the Cost R for all candidate encoding schemes, considering that the best scheme yields the lowest Cost R value.

Experiments & Results
In this section, the data sequences used are described in Section 3.1.The implementation setup of all experiments is given in Section 3.2.Finally, the results are shown and discussed in Section 3.3.

Data Sets Description
The

Implementation Setup
Our implementation runs on a personal computer with 2.4 GHz Core i3 and 2GB of RAM.For the application server, we installed Live555 media server for video transmission software [20].
In this paper, we use the joint multiview video coding software (v.8.5) [21] for encoding the data sets to extract the MVC sub-stream at the MVC extraction engine.The quantization parameter is set to 24, 28, 32, and 36.The search mode is set to fast search with search window set to 96 × 96 pixels.The length of a GoP: M is set to 12 for the Ballroom, Exit, and Vassar video sequences.Whereas, the length of a GoP: M is set to 15 for the Breakdancers video sequence .
The performance of competing schemes is evaluated by three metrics: • the RD performance (in dB•(kb −1 •s −1 ) at the basis of the higher the better, • the cost of splitting views (in frames) at the basis of the lower the better, • the cost of random access (in frames) at the basis of the lower the better.
The RD performance, measured in dB•kb −1 •s −1 , describes the trade-off between the video quality and the bit-rate of the video stream.The RD performance is at the basis of the higher the RD value, the better  Fig. 3: Rate-distortion performance of competing encoding schemes: MVC-HBP [22], YANG [23], KS-IPP [4], Simulcast [4], and Proposed [12], for different sequences with different quantization parameters (QP) using standard video sequences.
the MVC encoding scheme.Figure 3 shows the RD performance using competing MVC schemes applied to the data sequences shown in Section 3.1.at different quantization parameters.It can be shown that the MVC scheme used provides comparable RD performance compared to the KS-IPP [4], MVC-HBP [22] and YANG [23] schemes.Whereas, the MVC scheme used surpasses the Simulcast scheme [4] by an average improvement of 19 % in terms of RD performance.
Table 2 shows the cost of splitting views, Cost E , for competing MVC schemes.The MVC scheme used outperforms the MVC-HBP [22], KS-IPP [4], and YANG [23] schemes by an average reduction of 44.6 %, 14.2 % and 3 %, respectively, in terms of the cost of splitting views.
Tab. 2: Cost of splitting views, Cost E , using competing MVC schemes with different groups of GoP.The lower, the better.

Group of GoP size
Cost E (in frames) MVC-HBP [22] KS-IPP [4] YANG [23] Prop.Table 3 shows the cost of random access, Cost R , that can be determined by Eq. ( 11) using all competing MVC schemes.It can be shown that the MVC scheme used outperforms the MVC-HBP [22], KS-IPP [4], and the YANG [23] schemes by an average reduction of 42.9 %, 43.2 % and 1.1 % respectively, in terms of the cost of random access.3.4.

Results of the Proposed IMVS System
The proposed IMVS system, referred to as Proposed IMVS, is compared to: • the multiview video coding system [4] (i.e., referred to as MVC system), • the real-time transmission system of highresolution multiview stereo video over IP networks [9] (i.e., referred to as Multiview over IP system), • the client-driven selective streaming system for multiview video transmission [11] (i.e., referred to as Client-driven system).
The performance of competing systems is evaluated by three metrics: • the transmission bit-rate (in kb•s −1 ), • the pre-encoded data storage size (in kByte), • the ratio between transmission bit-rate and storage size (in (kb•s −1 )•kByte −1 ).
The transmission bit-rate metric is measured in Kbps and comes at the basis of the lower the better.Table 4 shows that the proposed IMVS system outperforms the MVC [4], the Multiview over IP [9], and the Client-driven [11] systems by an average improvement of 81.8 %, 63.5 % and 42.4 %, respectively, in terms of transmission bit-rate (in kb•s −1 ).This improvement can be analyzed as follows.The proposed IMVS system as well as the Client-driven system transmit only the requested view(s) to the client.Whereas, the MVC system transmits the whole set of views to the client.While, the Multiview over IP system [9] transmits the whole set of views into two separate streams to the client.The storage size, in KBytes, of the pre-encoded multiview video subset data is an important factor that impacts the IMVS system performance.Therefore, that factor comes at the basis of the lower the better.Table 4 shows that the proposed IMVS system outperforms the Multiview over IP system [9] by an average reduction of 18 %, and provides a negligible increase in the storage size compared to the MVC [4] and Client-driven [11] systems.
In terms of the ratio between transmission bit-rate and storage size (in (kb•s −1 )•kByte −1 ), Tab. 4 shows that the proposed IMVS system outperforms the MVC [4], the Multiview over IP [9] and the Client-driven [11] systems by an average improvement of 79 %, 68 % and 39 %, respectively.

Conclusions
In this paper, we first presented an inter-view prediction structure of the MVC scheme.The MVC scheme surpasses the KS-IPP, MVC-HBP and YANG MVC schemes by an average reduction of 44.6 %, 14.2 % and 3 %, respectively, in terms of splitting views cost and by an average reduction of 42.9 %, 43.2 % and 1.1 % respectively, in terms of the random access cost.The presented MVC scheme provides comparable rate-distortion performance compared to the aforementioned MVC schemes and surpasses the Simulcast scheme by an average increase of 19 %.
The proposed IMVS system exploits the MVC scheme used in [12] to ultimately improve the viewer interactivity.The proposed IMVS system outperforms the MVC, Multiview over IP and Client-driven system by an average improvement of 81.8 %, 63.5 % and 42.4 %, respectively, in terms of transmission bit-rate and by an average improvement of 79 %, 68 % and 39 %, respectively in terms of the ratio between transmission bit-rate and storage size.However, the proposed IMVS system has subtle increase in the storage size compared to the MVC and Client-driven systems, though the former outperforms the Multiview over IP system by an average reduction of 18 % in the storage size.

Fig. 1 :
Fig. 1: (a) The prediction structure of the proposed multiview video coding scheme, (b) The proposed temporal scaling.
Tab. 1: Description of the test video sequences [18],[19] Tab. 3: Cost of random access, Cost R , using competing MVC schemes with different groups of GoP.The lower, the better.