A New Processing Approach for Reducing Computational Complexity in Cloud-RAN Mobile Networks

Cloud computing is considered as one of the key drivers for the next generation of mobile networks (e.g., 5G). This is combined with the dramatic expansion in mobile networks, involving millions (or even billions) of subscribers with a greater number of current and future mobile applications (e.g., IoT). Cloud Radio Access Network (C-RAN) architecture has been proposed as a novel concept to gain the benefits of cloud computing as an efficient computing resource, to meet the requirements of future cellular networks. However, the computational complexity of obtaining the channel state information in the full-centralized C-RAN increases as the size of the network is scaled up, as a result of enlargement in channel information matrices. To tackle this problem of complexity and latency, MapReduce framework and fast matrix algorithms are proposed. This paper presents two levels of complexity reduction in the process of estimating the channel information in cellular networks. The results illustrate that complexity can be minimized from O(N3) to O((N/k)3), where N is the total number of RRHs and k is the number of RRHs per group, by dividing the processing of RRHs into parallel groups and harnessing the MapReduce parallel algorithm in order to process them. The second approach reduces the computation complexity from O((N/k)3) to O((N/k)2.807) using the algorithms of fast matrix inversion. The reduction in complexity and latency leads to a significant improvement in both the estimation time and in the scalability of C-RAN networks.


I. INTRODUCTION
Mobile networks have witnessed an unprecedented growth in terms of the number of users and the amount of data traffic.The 5G network is supposed to support 1 million user equipment (UE) per square kilometer with 1ms endto-end latency [1].Hence, the data rate of future 5G has been expected to be 10 times faster than the speed of 4G networks [2].The expansion requires novel technologies to be developed to meet future increased demand for mobile users.Recently, C-RAN technology has been gaining enlarged recognition from researchers and mobile network operators and has been nominated as the architecture of 5G [3].Unlike the current mobile networks, which have the baseband unit co-located within the cell site, baseband processing in C-RAN has been moved to cloud computing for central processing and management.C-RAN has three components (Fig. 1) viz (a) a remote radio head (RRH) [4] which acts as a remote antenna and is situated remotely, (b) low latency and high capacity optical communication networks known as fronthaul communicating links, which connect RHHs to the baseband unit (BBU) pool, and (c) a VBS (virtual base station) pool or BBU, which is situated in the cloud for centralized signal processing.Whilst C-RAN has many positive attributes, it also has some challenges.One of these challenges refers to an increase in computational complexity involved in acquiring the large size of channel information matrix H, with expansion of the network, due to centralized coordination and processing [5].This matrix includes the channel state information (CSI) between the user equipment (UE) and the VBS.This means the delay in estimating this information will delay the process of linking between the UEs and the VBS, in terms of adaptation.Consequently, the acquisition of CSI will affect the entire performance of cellular networks, particularly the system throughput, which ultimately limits C-RAN scalability.In this paper, two novel approaches are proposed for the C-RAN architecture to decrease the overhead of acquiring the CSI: MapReduce framework [6] and fast matrix inversion with multiplication algorithms.Deploying these two approaches in C-RAN will support network scalability and maintain the next generation of ultra-low latency requirements.
The motivation and the contribution of this paper can be summarised as follows.
The most important challenges in C-RAN is the challenge of dimensionality [7].This is due to centralized coordination of all network elements in cloud computing.In other words, the magnitude of channel matrix H in full centralized C-RAN increases dramatically at the increase in the number of RRHs and the UEs in the network.The burden of estimating this information leads to high processing time, which increases the network latency and reduces the throughput accordingly.However, as shown in Fig. 2, in the next generation 5G networks, the target latency is 1ms, which is considered a challenge and the key driver to implement the future 5G technologies (e.g.autonomous cars and tactile internet).To overcome this challenge, two novel approaches are proposed.
• The first is to deploy MapReduce as a processing framework in C-RAN networks to maintain low computational complexity in the centralized pool of VBS, to meet the future low latency and coherence time requirements, and then to support network scalability (for large numbers of RRHs).To the best of our knowledge, there is no prior work that has used MapReduce in the channel estimation of communication systems.
• The second is to propose fast matrix inversion algorithms, to reduce the processing time of channel estimation in C-RAN architecture.This algorithm takes the advantage of both Strassen's and Block LU decomposition to reduce the execution time of the matrix inversion of the MMSE estimator.The list of notations used in the paper is specified in Table 1, which aids in understanding the concepts discussed in the paper.The rest of the paper is structured as follows: section II discusses some related work on the research problem outlined above.Section III demonstrates the background on the main components of the research.Section IV defines the research problem.Section V presents MapReduce as a proposed solution along with the complexity analysis and simulation results.Section VI includes a mathematical modeling for the MapReduce framework using queuing theory.Section VII deals with the proposed fast matrix operations algorithms (Strassen and Block LU decomposition).Section VIII illustrates simulation results and discussion.Section IX finally presents the conclusions and the possible future avenues of exploration.

II. RELATED WORK
The problem of increased computational complexity in obtaining the CSI has a negative effect on the scalability of the C-RAN networks.A large number of research studies have focused on sparsification technique studies [7]- [9], to make the matrix of channel information sparser by excluding some entries, and then to reduce computational complexity of acquiring the channel information.However, these approaches may limit the network performance or decrease the network capacity, since the number of users is reduced according to -e.g.their distances from RRHs without considering the inter-site distance for different types of base stations (macro, micro and pico).
Another research strategy focuses on the antenna selection approaches [10]- [12], either selecting subsets or coordinating the number of active antennas.These approaches may reduce the overhead of CSI acquisition.However, the consequences might minimize the overall network capacity, because of the reduction in the number of antennas.These studies might consider the best case having low UE density, which requires less antennas.However, these approaches may fall short behind the provision of any improvement in worst-case scenarios, such as when having high density UEs, in busy urban areas when all antennas are required.
The authors in [13] and [14] have tried to minimize the complexity in the channel estimation algorithm itself.This involves either suggesting new estimators or modifying the current estimators.However, it has been observed that there is a trade-off between the performance (accuracy) and the complexity of the estimator.Many studies have tried to minimize the overhead of the most common estimator, which is the minimum mean square estimator (MMSE) by approximating the cubic complexity of matrix inversion by L-degree matrix polynomial, such as those presented in [14]- [18].
One of the trends in the research studies is to use time division duplex (TDD) systems, which utilize the channel reciprocity to reduce the overhead of the CSI acquisition.However, a TDD system has the following problems: first i) the ''pilot contamination'' is the biggest problem in TDD systems, which happens when the channel estimation at the base station in one cell becomes contaminated by users from other cells [19], [20]; ii) The same uplink/downlink timeslot arrangement must be used at all cell sites in adjacent service areas; iii) If the TDD spectrum is divided amongst multiple operators in the same area then all operators must be strictly time synchronized and have the same uplink/downlink timeslot arrangements; iv) According to the study in [21], the authors state that in the TDD systems there is still an essential need for a downlink reference signal (RS) and the uplink CSI feedback, since the measurement at the transmitter may not capture the downlink interference of the neighboring cells.Therefore, the downlink RS is still essential to find the CQI for the TDD mode [21]; v) Currently the LTE licenses worldwide are less than or equal to 40 for TDD systems, while for the frequency division duplexing (FDD) systems, LTE has almost 300 more than TDD [19]; vi) a TDD system requires a large guard period for the base station to switch from downlink transmission to uplink transmission and vice versa, which leads to decline both the efficiency and the cell throughput in comparison with the FDD system, that has two separated frequencies for the uplink and downlink [22].Hence, it is worth investigating the problem of CSI acquisition complexity in the C-RAN for the FDD systems.
In [5], [23], and [24], to decrease the overhead of CSI acquisition the authors have suggested the clustering methods in large networks, because in populated networks, the aim of obtaining CSI can be achieved by controlling the cluster size rather than the whole network.Clustering approaches can be considered as promising techniques to reduce computational complexity.However, choosing the size and the radius of the cluster is considered as one of the main challenges.Several studies demonstrate that it is possible to implement a clustering technique for a large group of RRHs in C-RAN architecture.In the literature, clustering has been deployed for different purposes, such as for power minimization [25]- [28], and for interference mitigation by using cooperative multi point (CoMP) among neighboring clusters [29]- [31].Another purpose of using clustering in C-RAN is for cost reduction [32], [33] by shortening the overall fiber cable length required in the fronthaul connection via deploying ring topology.Clustering is also used for complexity reduction, as in [5], [34]- [36].This is owing to the cooperative property of C-RAN architecture, which enables full sharing of channel information by exchanging the CSI among the VBSs in the cloud.However, the focus of these studies was more on the implementation of the clustering technique, overlooking the method of processing a large number of clusters (which is formulated from a great number of RRHs).Particularly with the deployment of next generation 5G networks, there is a need for a large number of access points.For instance, the distance between two access points is expected to be less than 150 meters.Hence, an efficient and powerful processing framework is required as a processing paradigm for providing scalable distribution of hundreds or thousands of RRH groups to cope with the requirements of the next generation of mobile networks.
In this research, two novel approaches are proposed to reduce processing complexity and latency.The first approach is in managing CSI acquisition in a well distributed manner using MapReduce framework.The algorithm lies in the clustering category, in which a group of RRHs is chosen to be assigned to a single VBS.All other VBSs cooperate and work in a parallel manner to minimize the latency whilst maintaining the network performance.It is worth stating that MapReduce has been employed as a scalable processing paradigm to accelerate the processing of big datasets in the cloud for a large number of applications, such as scalable streaming systems, real-time prediction for explosive traffic flow data, and also, MapReduce has been used in indexing web content with the database system in the Google search engine.
Secondly, the concept of fast matrix inversion using Strassen's algorithm and Block LU decomposition, is proposed to minimize the complexity in the computation of MMSE estimator.

III. RESEARCH ENABLERS
This section discusses the important components of this research: MapReduce, CSI, and common channel estimation algorithms.

A. BACKGROUND TO MapReduce
This is a programming framework, which enables the implementation of jobs in a distributed and parallel manner, as shown in Fig. 3.  MapReduce was introduced by Google [37] in order to process big data sets.Tasks are submitted initially to a division phase.Throughout this phase, the jobs include the number of tasks, mapped to a group of available mappers, for the purpose of processing.The mapper receives and produces an input and intermediate key/value pairs, respectively.The reducer takes an intermediate key and values set for that key, combining them together to form a smaller set of values.The main features of MapReduce are easy programming, automatic parallelism, and fault tolerance.In general, MapReduce exploits the ''Divide and Conquer'' principle.Hence, in this research instead of acquiring the channel information of all network elements (RRHs with their UEs) in one big channel matrix, the estimation algorithms are applied for a prespecified group of RRHs for each VBS using the MapReduce framework.

B. CHANNEL STATE INFORMATION IN CELLULAR NETWORKS
The mobile UE has to send its channel situation, which should be represented by the channel state information, to the base station for link adaptation between the VBS and the UE.The CSI mainly includes three reports; the precoding matrix indicator (PMI), rank indicator (RI), and channel quality indicator (CQI) [38].Since adapting the modulation and coding scheme (MCS) depends on the CSI for instance, when the value channel quality indicator (CQI) (which is one of the CSI reports that are used to indicate the quality of communication link) is high, the value of MCS is also high.The RI is the number of useful transmitter antennas that can be used in the spatial multiplexing mode.The PMI is a measure that provides a preferred precoding matrix to be used in a VBS for a given radio link in spatial multiplexing mode.Hence, the former functions of the CSI reports reveal that the inaccuracy in this information leads to performance degradation in the entire mobile network.

C. SYSTEM MODEL AND CHANNEL ESTIMATION ALGORITHMS
The channel model of the received signal is presented in Equation (1).In full centralized C-RAN, the received signal in the equation for uplink transmission [5], [11], where N represents the number of RRHs with singleantenna while K represents the number of antenna in UEs as follows: where, Y is the vector of the received signal; H is the channel state information matrix; X is the transmitted signal vector from K users; and Z i is the received noise vector.
In wireless communication systems, there are two common estimation algorithms, which are the MMSE and the least square (LS) estimator.The function of the estimator is to estimate the channel information (Matrix H), which includes the channel state information (CSI).The following two subsections present a brief description for both, the LS and MMSE estimators.

1) LEAST SQUARE ESTIMATOR
The objective of the channel LS estimator is to minimize the square value between the received signal Y and the pilot signal X.The least square estimate of the channel can be obtained by dividing the received signal by its expected value, as shown in Equation ( 2).The LS estimator has low computational complexity, since it is designed to work without any knowledge of the statistics of the channels.However, this estimator suffers from performance degradation due to the high mean square error (MSE) [14] in comparison with the MMSE, as shown in Fig. 4.
Where, ĤLS denotes the estimate of channel H

2) MMSE ESTIMATOR
The MMSE estimator performs second-order statistics to minimize the mean square error (MSE).The MMSE estimate of the channel responses as given in Equation ( 1) can be obtained as follows [14], [39].
The statistics of MMSE estimator is represented by three auto-covariance matrices, Rgg, R YY and R HH and the crosscovariance matrix R gY .These matrices can be calculated as follows.
Then, the estimation of the channel matrix in the MMSE estimator can be determined as follows: where, g: is the channel energy.Y: is the received signal.F: is the discrete-time Fourier transform (DFT) matrix.R gY : is the cross-covariance matrix of g and y.R gg : is the auto-covariance matrix of g.R YY : is the auto-covariance matrix of Y. R HH : is the auto-covariance matrix of H.
It is worth noticing that the MMSE estimator has the best performance in comparison with the LS and with other estimators, in terms of MSE [14], [39].The simulation test is conducted as shown in Fig. 4 to quantify the performance of both estimators with the following key settings, 1UE and 1VBS, 4 × 4 antennas at UE and VBS and 1.4 BW.The result in Fig. 4 shows superior performance for MMSE in comparison with LS.However, the main drawback of the MMSE is the high computational complexity as matrix inversion is required every time data changes [14]- [18].Therefore, in C-RAN unlike the current mobile networks e.g.LTE-A, this computational complexity of acquiring channel information is extremely expensive and will be increased several times due to full centralized coordination on cloud computing for hundreds of RRHs which leads to huge channel information matrices.Hence, the focus of this study is to minimize the complexity of acquiring the CSI using MMSE estimator in future C-RAN architecture.

IV. PROBLEM DEFINITION
In C-RAN, the extremely large channel matrices can be considered one of the main causes of the imperfection in the CSI.The detailed explanation for this problem is as follows.
After the introduction of MIMO technology, the importance of accurate CSI acquisition increased, since this affects how efficiently the MIMO system works [40].Practically, however, when there is an increase in the number of antennas, the size of the matrix H increases and this leads to increased overheads of acquiring the CSI [34].In the mathematical equation, the size of matrix H can be expressed by the following equation [38], [41], [42]: Where, Sc: number of subcarriers; N s : number of OFDM symbols; A r : number of receive antennas; A t : number of transmit antennas The number of OFDM subcarriers in the 3gpp standard is represented in Table 2 and the number of OFDM symbols is either (14 or 12) per subframe based on whether the normal or extended cyclic prefix is used [38].To quantify the amount of increase in the estimation time, a model test was performed with one base station (BS) and five user equipment (UE) using Minimum Mean Square Error (MMSE) estimation algorithm.Table 3 and Fig. 5 show how the dimension of channel matrix H increases, along with the estimated time increasing in proportion.Contrary to the LTE-A network that operates in one-to-one mapping between the base station and RRH, in the centralized RAN architecture, the dimensions of the channel matrix rise equivalently with the number of RRHs and UEs; because the signal processing is aggregated in the cloud [6].In addition, C-RAN operates (albeit ineffectively) in one to many mapping.This means that each one of the VBS in the cloud manages hundreds of RRHs to enlarge the network capacity [8].As a result of this, the computational complexity increases at the centralized BBU pool, thus decreasing effectiveness.Table 3 and Fig. 6 also illustrate that there is a significant increase in the percentage of latency with an increase in the number of antennas at the VBS.For instance, with 128 × 4 antennas system, the latency is increased almost 700 times compared to the latency of 1x1 antenna system.It is worth stating that the first value of estimation time for 1x1 system is considered as a base for calculating the percentage of increase in the latency for larger numbers of antennas.As per the above Equation ( 1), the reason for the overhead in C-RAN is due to the increase in the size of the matrix H.This also corresponds to the growth in the number of RRHs and UEs.Hence, for the extremely large channel matrix H, the estimation and processing delays the CSI acquisition in C-RAN architecture.That is, with the expansion of the network size, the burden of computational complexity per user expands as well [5].As a result, the delay of CSI leads to inaccurate decisions at the VBS.This is because the recently obtained CSI is not updated and consequently VBS does not represent the current (true) state at the mobile user.Figure 7 clarifies the problem of overheads by presenting how the increase in the dimension of H is proportional to the number of RRHs and the antennas of UEs with different channel bandwidths.
The Figures and Table above show that the magnitude of the channel matrix increases significantly with the increase in RRHs, which eventually causes the problem of increased computation overheads and increased time taken to acquire CSI.As a result, this limits the scalability of the network.
It is worth stating that the system bandwidth is one of the important factors that also causes the growth in the dimension of matrix H as shown in equation 1, although this aspect is not within the scope of this paper.To compound this problem, the bandwidth in future networks is anticipated to increase significantly by using the white space and millimeter wave bands.Therefore, when using central processing C-RAN, a considerable amount of RRHs with their own large bandwidth will create a very large matrix H and the entire RRHs will yield multiple bandwidth increases.However, using the proposed MapReduce, the bandwidth will not increase in the same manner.The aggregated bandwidth is a multiple increase of the RRHs group only, not the complete number of antenna.This will definitely off-load the computational complexity in C-RAN architecture.

V. MapReduce DESIGN AS A SOLUTION
In this work, the idea of distributed and parallel processing will be implemented in the C-RAN architecture.To attain this objective, MapReduce is used.As highlighted earlier in section IV, MapReduce is a powerful framework for performing varied jobs in a distributed manner [6].The advantage of adopting this framework is to split the task of obtaining CSI into several parts.This is to gain the advantage of parallel processing to minimize the delay of CSI acquisition.Another benefit is its support of scalability, which is a key feature of next generation cellular networks, such as 5G.Furthermore, the objective of using MapReduce is to utilize the processing capabilities of cloud computing by formulating C-RAN in ''group-to-one'' mapping between the VBSs and the RRHs, as illustrated in Fig. 8.In this division, the relation between the scalability of network and the channel estimation overhead is broken by using MapReduce.This is because the size of matrix H is limited to fewer numbers of RRHs and users.The operations of MapReduce require two phases; at the beginning, it splits the input data and then the parallel processing is performed on the partitioned data.The Mappers or workers will contain the algorithm of channel for j = 1 to V do 5: rrh(i, j) ←RRH(i, j) 6: end for 7: end for 8: return (V, M, rrh) 9: end function estimation (e.g.MMSE or LS) to acquire the CSI of the pre-specified group of RRHs.MapReduce can be deployed without the reducing phase [43].Hence, this research amends the MapReduce framework in order to enhance its capabilities to gain the advantage of parallel processing.
The most important amendment is the use of the framework with the omission of the reducing functions, as reduction is not a prerequisite in this design.Therefore, in this work the name 'reducing' phase can be changed into 'scheduling' phase, since its main function is to schedule the estimated CSI directly to the scheduler of the VBS.The detailed description for the proposed MapReduce design is explained in Algorithms 1 and 2.

VI. ANALYTICAL DESCRIPTION TO THE OPERATION OF MapReduce USING QUEUEING THEORY
This section presents an analytical model to analyze the performance of C-RAN architecture with MapReduce using queuing theory.Two main benefits can be achieved from grouping RRHs in C-RAN using MapReduce.Firstly, it can reduce the time of acquiring the CSI, since the size of channel matrix is minimized based on the size of the group.Secondly, the network with this formulation can be scaled up without (V, M, rrh) = RRHs_splitting (N, n) 4: -Select Master node / Controller; -Perform copies of user program (e.g.MMSE estimator) for (M) Mappers; 5: for id = 1 to V // V: no. of VBSs 6: (data) = Read (Y, X, Z) // call for read function 7: for i = 1 to n 8: for j = 1 to K 9: Y(i, j) = H(i, j) * X(i, j) + Z(i, j) 10: Data (id) = Y(i, j) 11: end for 12: end for 13: //Call MapReduce function result = mapreduce (read, @CSIAcquisitionMapper, @ Scheduling_CSI); 14: readall (result Add (outKVStore, 'VBS_id', H_ estimate (i)) 30: While (i = M) 31 End function increasing or sharing the burden of computational complexity of the CSI acquisition among all elements of the network.Therefore, MapReduce relies on increasing the number of mappers (workers or servers or processors) in a parallel manner to complete the processing of input data.Hence, queuing theory is the most appropriate tool to describe the internal operation of the MapReduce framework.Generally, the objective of queuing analysis is to predict the performance VOLUME 6, 2018 of the system for the purpose of minimizing the total costs of waiting time and then providing superior services.Likewise, using MapReduce reduces the time of CSI acquisition and this increases the system utilization via increasing the data throughput.Hence, this section will try to answer two questions: i) why does the data throughput increase after using MapReduce?ii) why does the estimation time decrease after deploying the MapReduce framework?
From a queuing theory point of view, MapReduce can be represented as a multiple server queuing system (M/M/S).
For the purpose of accurate description for the proposed MapReduce framework in C-RAN, the following two assumptions have been considered in the calculations of the throughput and the waiting time.Firstly, both the arrival rate λ and the service rate µ are the same for all servers.This is for the purpose of fair comparison and to be compatible with the real simulation environment.Secondly, all servers are identical and have equal capabilities.

A. THROUGHPUT OF THE SYSTEM
In general, based on the principles of queueing theory, the throughput (TP) of the M/M/S can be calculated by summing the throughput or the service rate µ of all servers as shown in the following formula [44]: The throughput of the pool of VBSs can be calculated by aggregating the throughput of all connected UEs in each VBS and then the total TP of all VBSs represents the total throughput of the pool of VBSs.The calculation of TP can be expressed in the following three equations: The aggregated throughput of all UEs represents the throughput per cell or VBS as follows: From equation 11, the total throughput for the pool of VBSs can be calculated as follows: Where, TP UE : Throughput of the user equipment TP VBS : Throughput of the virtual base station TP pool of VBSs : Throughput of the pool of virtual base stations B: Bandwidth SINR: Signal-to-noise-plus-interference ratio z: Number of UEs Before adopting MapReduce, there was a problem in the scalability of the VBSs because of the extremely large channel matrices, which make the acquisition of the CSI a formidable task in C-RAN.Then, after considering the grouping technique of the RRHs and parallel processing of MapReduce framework, the VBSs can be scaled-up.It is worth noting that calculating the throughput in C-RAN after adding MapReduce can follow similar characteristics of the multiple servers queuing system in increasing the system throughput when scaling-up the number of servers.

B. WAITING TIME
In queuing theory, the total waiting time (T total ) or commonly known as the response time [45] includes two parts as expressed in equation 13.The average waiting time in the queue (T q ) and the job service time (T s ) which is the time required for the job to be served in the server.
T total = T q + T s (13) In the present design, MapReduce can be presented as a queuing system with multi-phase service.Therefore, the service time is divided into three phases; the classifier (T c ) or the splitter of the incoming data, the mapping phase (T M ) and the reducing phase (T R ).Hence, the total service time for MapReduce framework can be expressed as follows: There is no need for the function of the reducing phase in the current design.The original reducing phase takes a set of an intermediate key-value pairs produced by the mapper as the input and runs a reducer function such as shuffling, sorting, filtering, aggregating, and combining.However, for link adaptation purposes, the VBS scheduler must directly receive the acquired CSI information at each mapper without the extra reducer stage.Therefore, the reducing phase is excluded from the current design by applying MapReduce with zero number of reducers.This can be considered as an advantageous point since the reducing phase takes additional processing time and can represent a bottleneck stage in MapReduce without careful design planning [46].Hence, the total waiting time can be represented by: It is worth stating that the total waiting time can be affected by the number of servers, the service rate per server and the length of the queue.The rest of this section involves the mathematical representation for the component of the response time.

1) SERVICE TIME IN CLASSIFIER
The classifier is used to split the input data and send it to the servers, with a service rate β ≥ mµ to avoid the case of bottleneck at the classifier, where, m is number of servers.Therefore the service time of classifier can be expressed as T c = 1 β .

2) SERVICE TIME IN MAPPING PHASE
The service time (T M ) for each mapper is assumed to be independent and exponentially distributed [47].Therefore, two possible assumptions for determining the service time at the mapping phase, which either consider all servers as identical and have the same capabilities (therefore the service time will be α where α = 1 µ or if the servers have different capabilities then the service time of mapping phase can be by taking the time of the slowest server.

3) WAITING TIME IN QUEUE
For the purpose of analyzing the average waiting time in the queue (T q ), a Poisson arrival λ is considered.Three different cases have been studied: (a) a queuing system with one server, (b) a queuing system with a number of servers equal to the number of chunks of input data, and (c) the queuing system with a number of servers less than the number of chunks of incoming data.

a: QUEUING SYSTEM WITH A SINGLE SERVER
In C-RAN the VBS that manage hundreds of RRHs can be considered a M/M/1 queue system.In the M/M/1 system the incoming data will suffer more waiting times in comparison to multi-server systems.In the case of M/M/1, the first chunk of data (N) will enter the server without waiting while the others will suffer from waiting in the queue as shown in Fig. 9. Hence, the total waiting time (TWq) in the queue can be described as follows: From equation 16, the sum of positive numbers can be represented by the following: The average waiting time of Tq can be calculated by dividing the total waiting time TW q by the total number of chunks of input data, which is the received service.
Where, α is the service time = 1 µ , hence the response time can be determined as follows:

b: QUEUING SYSTEM WITH NUMBER OF CHUNKS OF ARRIVAL DATA IS EQUAL TO NUMBER OF SERVERS
In this case, the number of chunks of data (N) which represent the data from the groups of RRHs in the queuing system is equal to the number of servers (m) in the multiple server system, as shown in Fig. 10.Due to the assumption above (N = m), then all the incoming N chunks of data will enter the m servers at the same time.Hence, there is no waiting time in the queue (Tq = 0).Therefore, the service time of the mapping phase requires the system to finish the processing of all data in the system, which will equal the total service time T s .This is due to the fact that all the data will be processed and finished concurrently.Therefore, the total time can be expressed as follows:

c: QUEUING SYSTEM WITH NUMBER OF SERVERS LESS THAN THE NUMBER OF CHUNKS OF ARRIVAL DATA (N)
The system, as shown in Fig. 11, is more realistic, since in the normal condition the incoming data is more than the available servers.Therefore, the average waiting time in the queue can be calculated as follows: According to the Figure above, the number of chunks (the number of input data N) can be expressed as follows: where, N: number of input data in form of chunks/jobs (group of tasks) of the arrival data.K: number of servers.b: the reminder of N/K.a: number of times the input data has matched with the number of servers.
From the Figure above, it is noticeable that the first K chunks of data will enter the service directly without waiting in the queue: (0)δ, while the second K chunks will wait (1)δ the third K chunks will wait (2)δ and this continues for (a-1) times, where δ is the summation of service time of classifier and mapping phase (δ = T c + T M ) Hence, the total waiting time in the queue can be expressed as follows: For the average waiting time divide equation 24 by N Therefore the overall response time can be expressed as follows: The former part of this work investigated the potential of reducing the overall network computational complexity.
This has been achieved by processing data of a group of RRHs instead of the entire number of RRHs in the network.
In order to complement this, the next part of this paper analyzes the computational complexity -per VBS -using a modified MMSE estimator, minimizing the execution time of the matrix inversion using fast matrix inversion by Strassen's algorithm and Block LU decomposition.

VII. REDUCING PROCESSING TIME OF MATRIX INVERSION USING FAST MATRIX INVERSION ALGORITHMS
In the MMSE estimator, the matrix inversion was considered the main reason of its complexity [14], [18].This section includes a novel idea of combining two algorithms, which are Block LU decomposition and Strassen's algorithm.Both of these algorithms share the principle of dividing the matrix into sub-blocks of small matrices to find the inversion or the multiplication of matrices.Strassen's algorithm breaks down the complexity of matrix inversion and multiplication from O(n 3 ) to O(n 2.807 ) [48].Block LU also improves computing efficiency by speeding up the execution time and utilizing memory hierarchies efficiently [49].The details of both algorithms are illustrated in the next sections.It is worth noting that although the reduction in complexity of matrix inversion might be small with Strassen, the reduction will be per matrix inverse.In other words, the aggregated total saving in the estimation time (ET) of the MMSE estimator will be large, since the MMSE includes more than one matrix inversion operation as shown in Equation (7).

A. STRASSEN's ALGORITHM
In this algorithm, the matrix inversion or multiplication is calculated by partitioning the matrix into smaller square matrices.Hence, in Strassen's algorithm these operations are applied on small sub matrices instead of applying the matrix operations directly on one large matrix.In this algorithm, the matrix inversion will convert into matrix inversion and multiplication.Strassen's algorithms for the matrix multiplication and inversion are explained as follows: 1

2) MATRIX INVERSION IN STRASSEN's ALGORITHM
As mentioned earlier, the calculation of matrix inversion should be achieved by breaking down the matrix inversion into multiplications of several matrices.To find the inverse of Z = X −1 for a square matrix X, the matrix (X) should be divided into half sub matrices.The size of matrix X is N = m2 k , where m and k are positive integer numbers and 2 k is the size of the sub-matrices.Strassen's algorithm of matrix inversion has been studied broadly [53], [54].The steps of Strassen's matrix inversion can be expressed as follows: Then the matrix inversion can be calculated as follows: The final results of matrix inversion are: The algorithm for the previous equations (28,29,30) can be illustrated as in Algorithm 4 below.Algorithm 4 is reformulated from [54].

4) CHALLENGE OF STRASSEN's ALGORITHM
The main challenge in Strassen's algorithm is that the dimension of the matrix X must be of order of 2 k , where, k is an integer [17].Therefore, this might be the main reason that makes this algorithm uncommon for the mathematical operations of matrices.On the other hand, the dimension of the channel information matrices, where we need to reduce complexity is not always of power 2. Hence, in this research, two methods have been studied to generalize Strassen's algorithm.These methods are illustrated as follows:

a: INCREASING THE DIMENSION OF MATRIX TO THE NEXT HIGHER POWER OF 2
As mentioned earlier, to use Strassen's algorithm, the dimension of matrix must be a power of 2. Hence, to solve this limitation, it can scale up the size of matrix to the next power of 2 by adding rows and columns of zeros with ones on the main diagonal at the end of the input matrix.This can be achieved using the Matlab function ''nextpow2(n)'' which returns the next power of 2. The following algorithm and numerical example clarify the operation of this technique.Read matrix M 2: Check the dimension (n) of M 3: if dimension (n) of M is not a power of 2 4: //Check if M is a square matrix 5: if size(M,1) ∼ = size(M,2) 6: error('The matrix must be square.')7: end if 8: Scaling up the size of M to the next power of 2 9: Set 1 for the main diagonal 10: Set 0 for the additional rows and columns 11: M 1 2 5 7 8 0 0 0 3 2 7 2 1 0 0 0 5 2 7 2 1 0 0 0 2 2 4 5 6 0 0 0 9 7 1 5 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 Then the next power of 2 after 5 is 8, hence, the size will be M −1 1 and M −1 , as shown at the bottom of the next page.It is worth stating that the aforementioned method can work with Strassen's algorithm.However, this contradicts the aim of this research, which is to reduce the complexity of acquiring the channel information of large channel matrices.In large channels matrices, the computation complexity will be increased with scaling up the size of matrix M to the next power of 2, until when the additional rows and columns are zeros.Hence, it is an inefficient approach in this research.To verify the system performance with this technique, a simulation test is conducted with the following settings, 5UEs, 1VBS, 64 antennas at the VBS, 4 antennas at UEs, 1.4 BW.The results in Fig. 12 below clarify the performance of the network before and after using Strassen's algorithm with the technique of next power of 2. The degradation in the performance comes from enlarging the original size of the matrix into a larger size of power of 2. In the next section, the block LU decomposition is proposed as a novel idea to tackle the problem of dimensions in Strassen's algorithm.It is worth stating that in 5G networks the target end-to-end latency is 1ms as mentioned earlier.
Hence, any reduction in the estimation time will contribute to the overall latency minimization.

B. GENERALIZATION OF STRASSEN's ALGORITHM USING BLOCK LU DECOMPOSITION
Block LU decomposition is an approach used to speed up the operation of matrices [56].The idea is to break down the high computation overhead of the big matrix into a set of smaller blocks of sub-matrices.The Block LU is used in this research for the purpose of generalization of Strassen's algorithm, which requires the dimension of power 2. It is possible to make the size of the block of sub-matrices equal to power of 2 (n = 2k) through Block LU, to fit with the requirement of Strassen's algorithm.The Block LU method is an improvement to the standard LU factorization approach.In Block LU decomposition, the operations of matrix inversion or multiplication start after dividing the upper and lower triangular into sub-blocks of small matrices.As a result of this division, the length of the operations vectors will reduce from the original size of matrix N into n, which is the size of the subblocks.This approach has been studied extensively, such as in [57] and [58].In general, the procedure of this approach can be classified into three main parts: partitioning of the original matrix into L and U triangular; dividing the lower and upper triangular into sub-blocks with a size of n; and then finding the inverse of each block and augmenting the results of sub matrices to determine the inverse of the original matrix.The details of Block LU decomposition and the combination of both Block LU-Strassen's algorithms will be explained in the following sections.

1) PART 1 (MATRIX DECOMPOSITION INTO L AND U MATRICES)
Y is a square matrix with order N, the decomposition of Y into a product of two upper and lower matrices (Y = LU) is illustrated in equation (35) where: the values of the main diagonal of the lower triangular l ii are set to one and also the unwritten values at both L and U set to zeros.The rest of values can be calculated using equations 38 and 39.The pivoting matrix P is also calculated by decomposition PY instead of Y, (PY = LU) to maintain high numerical stability and accuracy; where P is a permutation matrix of the rows, which includes 1s and 0s provided that the rest of the row and the column of each 1s are zeros.P matrix has no effect on the final result of matrix inverse.
2) PART 2 (PARTITIONING L AND U INTO SUB-BLOCKS) In this stage, the equation of the original LU factorization can be written in the form of sub matrices as follows: Hence, from equation 40 we can get the following matrices: Then the following steps will be followed to find the sub matrices: 1.If the size of the sub matrix Y 1 reaches the desired dimension (n), then decompose Y 1 to find L 1 , U 1 , P 1 and find the L −1 1 , U −1 1 elsewhere, continue partitioning into smaller matrices.if the dimension of Ŷ equal to n then decopmse Ŷ to find L 3 , U 3 , elsewhere, continue partitioning into smaller matrices.4. Obtain L 2 from P 2 and L2 .
After obtaining L2 and U 2 now it can find Ŷ = Y 4 − L2 U 2 , then if the dimension of Ŷ is equal to the block size n it can be decomposed to find L 3 , U 3 .It is worth mentioning that at the stage of calculating Y 1 and Ŷ if they are not small enough (size larger than n), the processes of partitioning and calculating the intermediate sub matrices continue recursively.

3) PART 3 (AUGMENTING THE RESULTS OF SUB-MATRICES AND FINDING THE FINAL INVERSE)
The inverse of the original lower triangular L, the upper triangular U matrices can be calculated in parallel, since both of them are independent, by augmenting all the sub-matrices of L (L 1 , L 2 , L 3 ) and U (U 1 , U 2 , U 3 ), and obtaining the permutation matrix P by augmenting P 1 and P 2 as follows.Algorithm 6 is recalled from the Block LU algorithm in [57].
Finally, the inverse of matrix Y will be: The inversion of Strassen's algorithm is faster than the traditional inverse algorithms used in traditional MMSE, since the complexity has been reduced into O(N 2.807 ).In this work, and to generalize Strassen's algorithm, the Block LU decomposition is proposed with modification.The idea is to use Strassen's inversion as the core of the Block LU decomposition instead of traditional inversion, as shown in Algorithm 7.
The proposed algorithm will ensure lower complexity and hence lower latency compared with Algorithm 6.

VIII. SIMULATION RESULTS AND DISCUSSION
This section is divided into two parts.The first part presents the results for deploying the MapReduce framework in the C-RAN network.The second part contains results for implementing fast matrix algorithms in the MMSE.This paper applies MATLAB R2016a to run the simulation tests.The simulation parameters setting are illustrated in Table 4.  are presented.At the start, to quantify the amount of high computational complexity for large channel matrices on the performance of the network, the first simulation test is conducted with 64 and 128 antennas at one VBS using MMSE estimator with 1.4 MHz bandwidth.To establish a logical line of reasoning, one should note that in this work, the RRH is deployed with a single antenna, therefore the words antenna and RRH are alternated to describe the same concept.For the purpose of comparison, the test has been repeated using perfect channel estimation.As shown in Fig. 13, the results reveal that the MMSE and the perfect channel estimation differ significantly in terms of data throughput.The perfect estimation or the theoretical estimation process means that no previous acquisition processing is necessary because of the perfect knowledge of CSI at the VBS.Consequently, unlike the real estimator (MMSE), the perfect estimator eradicates  the issue of high computation overhead in obtaining the CSI with a large number of antennas in the VBS.The subsequent test uses the distributed processing in the C-RAN architecture to reduce the computation complexity faced with a large number of antennas.
The aim of deploying MapReduce is to switch the technique of processing in C-RAN from centralized to a distributed one.This reduces the size of H and minimizes the acquisition time of the CSI.
Figure 14 shows the possibility of expanding the number of antennas without raising the computation overhead, while keeping the performance of the network the same for data throughput.While Fig. 15 illustrates the increase in spectral efficiency, Fig. 16 shows its percentage.In other words, the scalability of C-RAN with MapReduce increases the spectral efficiency of the network as the number of distributed RRHs rises.Simultaneously, the estimation time and the overall response time (RT) remain constant within the time of the applied group.This is due to processing small manageable channel matrices instead of a large matrix with high computational complexity.
The response time can increase dramatically in the central processing approach, as shown in Fig. 17.Such increases can be controlled throughout the clustering with the MapReduce approach.Here, the response time has been maintained for the 8th, 16th and 32th group order.The response is a crucial factor    when considering next generation networks that require low end-to-end latency.
Figure 18 also demonstrates the gain in the CSI acquisition time for different groups of RRHs.The figures demonstrate that the total estimation time can be minimized when grouping is performed.In addition, the estimation time can be limited by allocating a group of RRHs to each VBS, and this reduces the computational overhead of big channel matrices.The advantage is that it meets the low coherence time requirements for high carrier frequencies and UE speeds, and then to improve the accuracy of the CSI, since it conveys the state of the communication link between the UEs and the VBSs.The proposed distribution approach is beneficial in the next generation of 5G networks since the latency will be minimized to a significant level, as shown in Fig. 2, to meet the future critical time technologies.Simultaneously, a large number of antennas can be used in a scalable manner without raising the problem of the acquisition overhead in the whole network.The advantage is that in the cloud, the data, and the CSI can be completely distributed between VBSs [27].Therefore, instead of employing a large number of antennas /RRHs per VBS (that causes a high overhead, with MapReduce) a set of VBS with the pre-specified group of RRHs have been used to obtain the CSI.
The results in Fig. 19 demonstrate the possibility of reducing the estimation time for the CSI based on the chosen group size.Certainly, when the group size starts to increase, the size of the channel matrix also begins to increase as well.Hence, with a smaller group of antennas (8 antennas), the higher gain in the percentage of reduction RT is observed, and this percentage starts to decrease when increasing the size of the group.In Fig. 20, both the simulation and the theoretical results are drawn for the purpose of comparison between the simulation and the analytical results in terms of data throughput.The results show almost the same trend between analytical and simulation results.The result in Fig. 21 presents a comparison between the overall response time of the simulation results and the theoretical calculations with the number of deployed servers.The result indicates that as the number of servers increases, the estimation time and  the overall response time decreases proportionally.Therefore more VBSs are recommended to be added in C-RAN-based MapReduce to reduce the response time.There is a noticeable difference between the analytical and the simulation results in Fig. 21.In the simulation results, there are several parameters under consideration, and the simulation tool facilitates the calculations.While in the analytical method, for the sake of simplicity, fewer parameters have been considered in the calculations.
Part 2 (Simulation Results of C-RAN With Fast Matrix Algorithms): Several tests are conducted to examine the proposed fast matrix algorithms.At the start, to clarify the speed of Strassen's algorithm, a in the processing time is made between Strassen's inverse algorithm and traditional inverse function.The two main points are illustrated in Fig. 22, which are that: Strassen's algorithm requires less processing time and it gives more gain in time when scaling  up the size of the matrix.However, this test is limited to a dimension of power 2 matrices.
Applying the combination of Strassen's and Block LU enables a considerable reduction in the estimation time of acquiring CSI, due to decreasing the processing time of matrix inversion.The result in Fig. 23 illustrates the percentage of gain in the estimation time of channel information when using block LU -Strassen's algorithms over the traditional inversion.As mentioned earlier, Strassen's algorithm  can support scalability.As the number of antennas (or equivalently the size of the channel matrix) increases, the gain of execution time decreases.
The results in Figures 24 and 25 show noticeable improvement in the data throughput of the network when considering block LU-Strassen's algorithms in calculating the matrix inverse in the MMSE estimator for VBSs with 64 and 128 RRHs.The reason is due to the reduction in the processing time of the matrix inversion, which leads to a reduction in the estimation time of the CSI acquisition.Therefore, the accuracy of the CSI is improved by adapting the communication channel more quickly between the UE and the VBS.
Figure 26 demonstrates a comparison of the network performance in terms of data throughput between the two methods that have been used in this research for generalizing Strassen's algorithm.The results show that Stassen-Block LU is more efficient, since it minimizes the initial system delay and speeds up the system response time.
The advantage of reducing the estimation time with Stassen-Block LU -particularly with the case of 64 antennas at the VBS -is that it can increase the size of the group of RRHs in MapReduce to 64 RRHs with acceptable system performance.Hence, the C-RAN network with 128, 256,  512, 1024 RRHs can be represented with groups of 2VBSs, 4VBSs, 8VBSs and 16VBSs respectively, with 64 RRHs per VBS.For instance, Fig. 27, illustrates that 128 antennas can be represented by two VBSs with a group size of 64 RRHs, this scenario is not possible to implement with traditional matrix inversion due to the high execution time of matrix inversion.
It is worth mentioning that both of the proposed techniques (MapReduce and fast matrix algorithms) provide a considerable improvement in the reduction of computational complexity of acquiring CSI.The summary of the overall reduction is illustrated in Table 5.

IX. CONCLUSION
The architecture of C-RAN provides central network management and reduces the expenses of network deployment.
However, it has challenging computational complexity, which leads to limited network scalability.Two novel approaches have been developed for C-RAN architecture to reduce the computational complexity in acquiring channel state information while maintaining network scalability to meet the demand of future 5G networks.The proposed approaches are supported by results that show improvement of the system performance (particularly, the data throughput) due to a significant reduction in the acquisition time.MapReduce, as a distributed processing framework, can minimize per network estimation time, based on the size of the group of antennas.Additionally, fast matrix algorithms have reduced per VBS estimation time, via decreasing the time of execution for the matrix inversion in the MMSE estimator.Further work will involve using dynamic grouping size, based on the optimal available capacity, and furthermore, planning and optimization for the optimal capacity of VBSs of C-RAN in cloud computing requires more investigation.

FIGURE 2 .
FIGURE 2. Theoretical latency requirement in 4G versus the expected 1ms in 5G.

FIGURE 3 .
FIGURE 3. The general diagram of MapReduce framework.

FIGURE 4 .
FIGURE 4. Comparison between MMSE and LS estimators in terms of MSE.

FIGURE 5 .
FIGURE 5.The rise of estimation time in relation to the no. of antennas.

FIGURE 6 .
FIGURE 6. Percentage of latency increase versus number of antennas at VBS.

FIGURE 7 .
FIGURE 7. The increase in the dimension of H with the growth in RRH for different bandwidths.

− 1 =
Strassen_inv (B, Block_size); 12: Return the original size of M 13: Return (M −1 ) 14: End if b: NUMERICAL EXAMPLEM is a matrix of dimension 5x5 as shown below.Then to find the inverse of M by using Strassen's algorithm, the steps in Algorithm 5 above are applied as follows.

FIGURE 12 .
FIGURE 12. Network performance using Strassen with next power of 2.

FIGURE 13 .
FIGURE 13.Test in above figure shows the effect of expansion on the cell throughput with (64 and 128 antennas) per VBS.using MMSE and perfect estimation algorithm.

FIGURE 15 .
FIGURE 15.Spectral efficiency with the rise the number of antennas.

FIGURE 16 .
FIGURE16.Percenatge of increase in the spectral efficiency compared to the lagacy 8 antennas system.

FIGURE 17 .
FIGURE 17.Total response time for three groups of 8, 16 and 32 antennas at each VBS.

FIGURE 19 .
FIGURE 19.Reduction gain (%) in response time for different RRH groups.

FIGURE 24 .
FIGURE 24.Throughput per cell for 64 antennas at the VBS with fast algorithms (Block LU + Strassen).

FIGURE 25 .
FIGURE 25.Throughput per cell for 128 antennas at the VBS with fast algorithms (Block LU + Strassen).

FIGURE
FIGURESystem performance comparison between the proposed methods for generalizing Strassen's algorithm.

FIGURE 27 .
FIGURE 27.Throughput per pool of VBSs for scalable number of antennas (group of 2VBSs with 64 antennas) using MapReduce.

TABLE 1 .
List of notations.

TABLE 3 .
The dimension of the channel matrix H against the estimation time overhead increase.