GPU-Based Sparse Power Flow Studies with Modified Newton’s Method

The Power system is getting larger and more complicated due to development of multiple energy supplies. Solving large-scale power flow equations efficiently plays an essential role in analysis of power system and optimizing their performance during normal or contingencies operation. The traditional Newton-Raphson (NR) algorithm used for power flow calculations is computationally expensive due to updating Jacobian matrix in each iteration. As alternative to update the Jacobian matrix repeatedly, this paper presents a GPU-based sparse modified Newton’s method by the introduction of a fixed Jacobian matrix, which integrates vectorization and parallelization technique to accelerate power flow calculations. Moreover, this research in the paper also investigates the performance of the corresponding CPU versions and a MATLAB-based library package, MATPOWER. The comparison of the results on several power system and power distribution systems demonstrate that the GPU variant is more reliable and faster for power flow calculation in large-scale power systems.


I. INTRODUCTION
Power flow studies are one of the most important aspects of power system planning and operation [1]. Nowadays, the power system modeling and analysis have been challenging on power engineers due to the introduction of new energysupplies and heavier loading, which brings great pressure for power flow calculation [2]. In practice, the conventional NR solver always takes dense matrix format, which consumes much computational resources and storage spaces to calculate power flow. Because of these shortcomings, the traditional NR solver always causes program crash and fail to converge for power systems with over thousands of buses.
The introduction of Graphics Processing Units (GPUs) has brought a revolution in the parallel computing arena [3]. With the benefit of high floating-point processing performance, huge memory bandwidth, and low cost [4], GPUs are not only widely applied to many innovative areas such as artificial intelligence [5], scientific simulation [6]- [9], cryptography [10], [11], integrated circuit analysis [12]- [14], and medical imaging [15], but also many power system applications including optimal power flow [16], power flow [17]- [26], transient stability simulation [27], [28], and contingency analysis [29]. Computer Unified Device Architecture (CUDA), as a general-purpose computing architecture platform, supports many languages including C++, Python, Fortran, OpenCL, OpenMP, and more [30], which makes it easier to explore salient feature of heterogenous computing system at the low-cost of relearning new programming language, eventually achieving dramatic speedups in computing performance [3]. Furthermore, for the benefit of developers working in other languages, NVIDIA CUDA provide us an opportunity to utilize cupy to solve the power flow problem in a vectorization parallelization manner. With the help of these features, one can explore Compressed Row Storage (CRS) matrix format to save more memory and utilize the GPU-based approach to accelerate the power flow calculation for higher efficiency.
GPU-based parallel power flow calculation along with methods to improve the performance has been addressed in literature including [2], [3], [18]- [20], [23], [29], and [31], [32]. Reference [3] suggested that a GPU-based parallel Newton-Raphson's method can achieve a speedup ratio of 3.27 times for a system with 300 buses. Reference [32] presented a speedup of 2.86 times for large power system with GPU-based fast decoupled power flow (FDPF) method. Reference [23] shows that a GPU-based parallel power flow 2 VOLUME XX, 2017 implementation of the conjugate gradient method with Chebyshev preconditioner gives speedup of 10.8 times. Performance comparison involved in GPU-based Gauss-Seidel (GS), Newton-Raphson (NR) and Fast-Decoupled (FD) method are discussed in [20], with speedup ratios of 0.05 times, 1.74 times and 1.30 times, respectively. The GPU-based biconjugate method in [18] presented a speedup of 2.1 times. The GPU-based parallel power flow calculation based Forward-Backward method in [19] gives a speedup ratio of about 1.58 times. From the above findings, GPU-based parallel methods could achieve noticeable acceleration effect in the power flow calculations, compared with sequential executions on a single-core CPU.
Although several papers have been published on GPUbased parallel power flow methods, due to their main focus on dense matrix to leverage the parallel feature of GPUs and improve performance, they can fail to maximize the capability of the GPUs. Specifically, based on the CUDA platform, Wang et al. proposed a GPU-based power flow solver with continuous Newton's method [2]. In [2], the method not only needs a great amount of space to store zero elements in admittance and Jacobian matrix, but also consumes much computational resources to calculate many potential nonzero fill-ins in non-coalescing way during the inverting of Jacobian matrix. This can easily cause the programming performance degradation and lead to out-of-memory errors, for the size of Jacobian matrix increases rapidly following power system bus growth. In addition, there also exists a few GPU-based sparse power flow methods such as [33]- [36] and [37]. References [33]- [36] and [37] propose GPU-based parallel sparse lowerupper (LU) factorization method to concentrate on accelerating the solving of linear equations of the power flow, while some with optimized potential operations in [33]- [36], are ignored such as calculating derivatives of the magnitude and angle of bus voltages, creating a fixed Jacobian matrix in a vectorization parallelization manner and so on. In a word, these approaches just only concentrate on optimizing sparse matrix factorization to solve power flow equations and ignore to accelerate the rest parts of power flow calculations. Although the method in [37] considers the whole performance of power flow calculation, dense column major storage in [37] can fail to fully utilize the sparsity of Jacobian matrix in the power systems.
To address such challenges, this paper proposes a GPUbased method with CRS format to save memory for relieving the burden of GPU and CPU and enhance scalability of largescale power systems. Instead of updating the Jacobian matrix repeatedly and reduce the cost of communication between GPU and CPU in the NR method, a fixed Jacobian matrix is introduced to calculate the unknown state vector such as magnitude and angle of bus voltages. Furthermore, creation of the fixed Jacobian matrix takes the vectorization and parallelization manner, which utilizes coalescing feature of the GPU platform. It can be clear seen that the calculating time of the fixed Jacobian matrix approximately keeps constant despite large-scale power system and improve the performance of the whole power flow calculation. This will be explained in Section Ⅳ. The previous work in [2] needs to solely calculate sparse Jacobian matrix inversion, and in this paper, these time-consuming processes are converted into solving of linear equations to avoid the shortcomings of directly inverting large-scale sparse matrix. To further reduce unnecessary memory-consumption and computational overhead, the submatrix of the fixed Jacobian matrix in power system is omitted according to a fast-decoupled power flow (FDPF) method since large-scale power system always has much fewer connections. The GPU-based method in the paper is realized with salient GPU libraries such as cupy. In addition, as for validating the GPU-based method, the corresponding CPU programs are designed. Instead of redesigning new linear equation solvers, the Eigen library with QR solver and LU solver are utilized to solve linear equations and compare with GPU-based solver. This will be further elaborated in Section-Ⅴ.
The rest of this paper is organized as follows: Section Ⅱ briefly review the GPU architecture and CRS format to understand the basic concept about GPU and storage of sparse matrices. In Section Ⅲ, the GPU-based sparse modified Newton's method is proposed based on Runge-Kutta mathematical algorithm. Detailed implementation on the GPU is explained in Section Ⅳ. Results and discussion are given in Section Ⅴ, and Conclusions are shown in the last section.

II. GPU ARCHITECTURE AND CRS FORMAT
In this section, the GPU architecture, a CUDA programming model and CRS format are briefly reviewed.  The philosophy design of GPU architecture is commonly based on throughput-oriented concept which means GPU can invoke thousands of threads to execute simultaneously. CUDA is a suite of technologies, which enable programming on the NVIDA GPUs [34]. An array of streaming multiprocessors (SMs) is the critical hardware in a CUDAcapable GPU. Each SM manages scheduling threads and executes all threads in a warp following the Single Instruction, Multiple Data (SIMD) model. Fig. 1 depicts a typical schematic of a SM consisting of many streaming processors (SP), and each SP contains an arithmetic logical unit (ALU) supporting integer and floating-point arithmetic operations [2]. A register file in each SM has a very short access latency and drastically higher access bandwidth, compared with the global memory.   The threads lay the foundation of a parallel program and they are organized in the blocks. Furthermore, blocks of threads are grouped into grids, which makes programmers able to explore the capabilities of GPU parallelism. CUDA also assume that both the host (CPU) and the device (GPU) maintain their own separate memory spaces, which are referred to as host memory and device memory respectively [33]. The program flow is controlled by CPU and computationally intensive tasks are offload to GPU by invoking multiple kernels [37]. Specifically, invoking each kernel every time, thousands of threads in the grid are further divided into multiple thread blocks and mapped onto SMs that are referred to as warps. All threads of a warp eventually execute instructions in a lock-step manner in accordance with the Single Program Multiple Data (SPMD) concept [38]. Based on these salient parallel features of NVIDA GPUs, power flow calculation can be operated in a vectorization parallelization manner with cupy.

B. CRS FORMAT
In practice, each bus in a power system has a few links [39]. Thus, it is a good choice to store the admittance and Jacobian matrix in CRS format due to high sparsity of these matrices. Specifically, the CRS format stores the nonzero elements of the matrices in contiguous locations [40]. Unlike storage of dense matrices, the CRS format only stores triple vectors, with respective vectors of nonzero values, column indices and locations. Assuming × is a general real square sparse matrix. The first vector denotes nonzero vectors in as the floating-point array and the second vector contains the column indices of the entries in as integers. The last vector contains the location information of . Hence the matrix × is represented as triplet form: From above (1), it was clear that the storage spaces consume less compared with dense matrix format. Instead of storing 2 elements, just only 2 + + 1 values need to be stored in sparse matrix format [41].

III. GPU-BASED METHOD INSTRODUCTION
In this section, the NR method is reviewed, firstly. Then, the modification of NR method based on GPU is introduced.

A. Review OF THE NR METHOD
This subsection briefly presents overview of the NR method and explains the derivation of the method depending on mathematical power flow model. For brevity, all the notion in the paper are presented in Table Ⅰ. The NR method is an iterative algorithm, which is based on linearizing the power flow equations to find the unknown state vector ∆ and ∆ above, firstly, the network of power system should be created. Supposing a power system with buses, the injection complex power, , at bus can be defined as [31]: The admittance between bus and bus can be further expressed by the conductance and the susceptance : From (2) and (3), the injection complex can be converted to be rectangular form. Then the formulations can be written as: Where is the difference of voltage angles between buses. Since (4) and (5) are a set of nonlinear equations. Taylor series are applied to convert these nonlinear equations to a linear system, which can be expressed in matrix form as: where Jacobian matrix consists of the partial derivatives of ∆ and ∆ with respect to the voltage and [1].
For a linear system in (6), the NR method is used to evaluate injection power at each bus in the power system and update state vector in each iteration process. Equation (6) is solved iteratively until the tolerance satisfies requirements of the stopping creation.

B. MODIFICATION OF NR METHOD
From above inspiration, the nonlinear power flow are equations are also regarded as: Equation (8) is expanded with first order Taylor series at the fixed point depend on above NR method. Then, equation (8) can be changed into a set of iterative formulas, and they can be expressed as: Suppose a set of autonomous ordinary differential equations, as follows: ̇= ( ) (11) Where ̇= ℎ , equation (11) can be solved for by integration: Which yields: According to the Euler's method, the integration in (13) can be defined as: Where ∆ℎ = ℎ +1 − ℎ , thus, substituting (14) into (13) yields: The relationship between (10) and (16) is depicted like (17) if the step size ∆ℎ equals one.
because of ̇= ( ), the nonlinear equations in (8) can be eventually converted to: Equation (18) is usually considered as the normal continuous Newton's method for power flow [2]. In fact, if derivatives of the state vector in (14) approach to zero, ( ) should have solutions , as long as ( ) is not singular. Thus, the Jacobian matrix in the modified GPU-based method does not have to be updated on each iterative process. To reduce the computational overhead, the work in the paper takes a fixed Jacobian matrix 0 , which is usually calculated when equals 0 at the beginning of programming. Thus, equation (18) can be further written as: Since the transmission line has very small resistance, the angle of admittance between buses approximately closes to ±90° and the adjacent buses tend to have a smaller phase angle difference. Thus, 12 and 21 can be ignored as zero matrices to reduce computational overhead and saves memory space, for the zero elements are not involved in calculation and storage in sparse format. 0 in (19) can be simplified as: Although above modifications improve the performance, inverting of the Jacobian matrix 0 solely is still computationally expensive in the processing of (19). Because inverting Jacobian matrix 0 needs to calculate the many potential fill-ins and allocate much extra memory to store them during inverting of sparse matrix. These steps of inverting the Jacobian matrix 0 mainly includes three steps, below: 1) Allocate the extra memory for inverse matrix.
2) Implement LU Decomposition and fill the nonzero value in the relative position.
3)Calculate the inverse matrix depending on decomposition in 2).
Steps from 1) to 3) traverse each location of entries in the irregular decomposition matrix on the term-by-term way, which usually introduces many conditional statements such as if-statements searching for the exact location of the entries.

VOLUME XX, 2017
Despite great computation capability of GPU, this way can not only cause the divergence in a warp but also impose a huge scheduling burden on the GPU, especially, when the Jacobian matrix has a huge size. To mitigate the problem, calculation of Jacobian matrix inversion can be converted into solving a set of linear equations. Equations (19) can be viewed as: Weighing the precision and speed, the second order Runge-Kutta formula is applied to (21); the iterative formulations are defined as:  [43], [44] suggest that local truncation error and global accumulation error are (ℎ +1 ) and (ℎ ) respectively. For the viewpoint of speed and precision, the step size ∆ℎ is chosen between zero and one, because the error would be smaller as the order increases. Additionally, the step size ∆ℎ is also called the fixed time step [2], and the value of ∆ℎ can not only determine the iteration number and calculation time, but it also has some extent influence on the convergence of power flow calculation. In general, the iteration numbers and calculation time would decrease as the fixed time step increases. However, the optimal ∆ℎ in one power system can probably cause the divergence in the other power system. Weighing the calculation time and convergence of power flow calculation, the compromised value of ∆ℎ is chosen to satisfy the all the datasets in the paper. The time step in this method is chosen in the similar way as the factor in the reference [2]. Besides, |∆ | < is serving as the stop creation in the GPU-based method. Though the method in the paper might increase several iterative times because a fixed Jacobian matrix cannot accelerate the convergence in the process of the power flow calculation. Moreover, the method belongs to the second order numerical methods, which increase the burden of computation. But this high order method is a good choice for GPU, for the fixed Jacobian matrix can reduce the cost of communication between host and device. Especially, in the process of solving (21), steps from 1) to 3) can be omitted, and 0 just needs transmitting to the device at the beginning of programming. Therefore, these steps in the method can fully utilize the high computing and parallel capability of GPU to compensate the overhead of high order computation. Thanks to the 0 is a constant, the complexity of high order method can be simplified because the 0 does not have to be updated following each iteration, compared with traditional NR method. Furthermore, despite 0 would be involved in calculating state vector on each iteration, the cost of computation is less than that of inverting 0 directly in the reference [2].

IV. IMPLEMENTATION ON THE GPU
In this section, the heterogeneous architecture and flow chart of GPU-based method are given. Then, creation of a fixed Jacobian matrix in a vectorization and parallelization technique is elaborately discussed.

A. PROGRAMMING ARCHITECTURE
The architecture is a heterogeneous structure, for it consists of CPU and GPU parts. The computationally intensive tasks are offloaded on the GPU, while the rest parts are sequentially executed on the CPU. Both the communication between GPU and CPU is connected by PCI express (PCIe) bus. Hence, the speed of data transmission between them heavily depends on the bandwidth of PCI express. Although the way of transmission has a little negative influence on the performance, the extremely high floating-point processing capability and huge memory bandwidth of GPU would compensate the shortage. In addition, the time-consuming data transmission just have occurred twice in the programming architecture. The first is referred to as loading original datasets, and the other is receiving the results of power flow calculation at the end of programming. This architecture not only reduces computation burden of CPU, but also avoids the cost of repeatedly invoking the GPU's kernel function for transmitting data forward-backward between host and device. With such features, the programming architecture maintains the balance between CPU and GPU and maximizes the acceleration effects, which is depicted by Fig. 3  The process of solver in (21) consists of several steps, as follows: 1) load the datasets and initialize them.
2) Calculate partial derivatives of power injection.
3) Calculate a fixed Jacobian matrix in a vectorization manner.

B. THE CALCULATION OF JACOBIAN MARTIX
This subsection explains the process of forming a fixed Jacobian in a vectorization manner. This vectorized programs can execute multiple operations concurrently via a single instruction, whereas scalar one can only operate on pairs of operands [50]. Forming a fixed Jacobian matrix consists of two parts including calculating partial derivatives of magnitude and angle of bus voltages, and creation of the Jacobian matrix. Specifically, calculating the partial derivatives such as 11 and 22 in (20) can be converted to a group of vectorized formulas, as follows: is the sparse admittance matrix and unit imaginary number, 1 . In the process of (27), (28) and (29), instead of accessing data in a random manner, all threads in a warp executed the same instruction at any timing point because the entries in the sparse matrix is stored in a consecutive location.
In the case, the technique is usually referred to as coalesces, which is leveraged by the GPU and integrated into cupy libraries. The bold terms in the algorithm 1 respectively denote that the corresponding arithmetic operations are operated on GPU in a vectorization parallel mechanism.
Specifically, Fig. 4 depicts the technique mechanism about implementation of the programs in a vectorization manner .   T0  T1  T2  T3  T0  T1  T2  T3  T0  T1  T2   Supposing the sparse admittance matrix (4,4) and four threads in a warp, these entries in the matrix have a consecutive location in the memory and are stored in rowmajor flat mode. During the process, Dynamic randomaccess memory (DRAM) divides the sequential sixteen locations into four burst sections. All threads in a warp would just cover all entries, while only one DRAM request is executed. Thus, the access is fully coalesced, and no other threads are left. In practice, data from power system can be organized in a coalesced way and the arithmetic operation in Algorithm 1 can be implemented with the benefit of this technique.
The creation of a fixed Jacobian is based on the voltage partial derivative vectors, and . The computational formulas can be written as:

= ( [ , ])
where and are respectively corresponding to index vector of PVPQ-buses and PQ-buses. Similarly, the same vectorized idea is also applicable to (30) and (31). However, vectors in and , which contain information of bus indices, are not consecutive. It causes the divergence in a warp and decreases performance because of the irregular storage. To relieve this part, extra memory is allocated to restore the indices. The components in the allocated memory are organized in a consecutive array. Thus, instead of traversing all elements with for-loops on the term-by-term way, the elements in the memory are managed in a coalescing manner as a set of arrays. Despite this approach brings some expense of space, the cost of space is well worth of increasing of speed. Compared with reference [41], the computational complexity can be reduced from ( 2 ) to ( ) , for the algorithm 2 and algorithm 1 in the paper replaces the traversing with vectorization operation. Although the approach in [41] takes the CRS format to reduce the numbers of iteration, it still needs to traverse all nonzero elements by two for-loops during the forming Jacobian matrix, which causes a negative influence on performance as power systems increase rapidly.

V. PERFORMACE RESULT
In the section, firstly, the computing experiment setup and test cases will be introduced. Then, the analysis of storage between sparse format and dense format will be discussed. Next, characteristic curve of forming a fixed Jacobian matrix will be presented. The comparison of overall performance and speed up will be presented. Also, the analysis of stability about CPU-based QR and CPU-based LU method also be discussed. Finally, the distribution networks are tested on the GPU-based method in this paper.  The computing experiments in the paper are carried out on a workstation equipped with NVIDIA RTX4000 GPU and 8GB DRAM. The workstation has an Intel(R)-i9 CPU and 16 GB RAM. The version of cupy library is 10.2. Thus, the CUDA driver version is also 10.2 in accordance with cupy. The version of MATPOWER and Eigen are 7.1 and 3.4, respectively. The parts of MATPOWER package have been changed into generating a fixed Jacobian matrix instead of dynamic Jacobian matrix. To assure the precision and speed, the stop criterion is set to 1 − 4 for all the power flow. The specification of requirements is shown on Table Ⅱ. The test datasets are all from MATPOWER, which can be categorized into two groups as Table Ⅲ and Ⅳ.

B. ANALYSIS OF STORAGE
For a power system, most spaces are used to store the entries of the admittance matrix and a fixed Jacobian matrix. In this work, the comparison of memory usage between dense matrix format and sparse matrix format is show on the subsection. Since all data in the whole programming are single float-point data type, each element in the matrix would occupy four bytes. In fact, memory is also regarded as a flat model. Therefore, in the analysis of storage, admittance matrix and Jacobian matrix can be combined as whole matrix for simplicity. Therefore, the memory-consumption of admittance matrix and a fixed Jacobian matrix can be cumulated at the same number of power system buses. Table-Ⅴ lists that specific memory-consumption between dense and sparse format. The memory-consumption increases very quickly in dense format. Besides, Fig. 5 explicitly shows that the sparsity of the combined matrix as the numbers of bus increases. It suggests that sparsity would be high when power system has large scale.  In addition, Fig. 6 depicts the tendency of memory usage. Considering the above results, memory-consumption almost represents an exponential growth in dense format. Moreover, most of values are zeros due to nature of power system structure. These values not only have no effect on computation but also consume many spaces.
Compared with dense format, memory-consumption in sparse almost keeps a linear increase. The speed of calculation will increase as the storage burden decrease. Moreover, most of components are zeros due to the character of power system. Thus, these zero components not only have no effect on computation but also cause a large waste of memory. Besides, the space complexity can be reduced from ( 2 ) to ( ) , which provides opportunity to operate a great amount of data for the larger power systems.

C. COMPARSION BEWTEEN CLASSIC AND MODIFIED METHOD
This subsection presents the difference between classic and GPU-based method in the paper. For brevity, IEEE 4 bus system is taken as an example. The speed of classic method on the CPU is faster than the modified method on the GPU because the power system is so small that the GPU takes much time on transmitting Instead of utilizing the high computing capability of GPU. Therefore, the calculation times are 0.6ms and 246.8ms corresponding to the CPU and the GPU, respectively. Table Ⅵ and Table Ⅶ specifically list that calculation results and the norm of power mismatch on the different platform on each iteration. Moreover, Fig. 7, Fig. 8 and Fig. 9 present the voltage magnitude and phase angle profiles based on the classic NR, the FD and the GPUbased method on the IEEE 14 bus system, respectively. Besides, all the Jacobian matrices are taken CRS format in the process of power flow calculation. These profiles of voltage magnitude and phase angle also proved the point that the classic NR method can get convergent solution faster other methods, when the dataset is enough small.   From Table Ⅵ and Table Ⅶ, it can be seen that the GPUbased method can work as precisely as the classic method and has more exact value according to the value of norm. FIGURE12. An illustration of the GPU-based method.
Specifically, the local error of the NR method is (ℎ 2 ) because the true solution of (8) is taken a finite number of terms from the Taylor series, and then it can be viewed as: (32) Where ℎ = +1 − and ̇ is the derivatives of power flow equations. For the GPU-based method, the local error is (ℎ 3 ) because (8) can be evaluated based on the second Runge-Kutta method: Fig. 11 and Fig. 12 simply explain the GPU-based method in this paper can gain more exact values than the classic NR method. Specifically, since the NR method takes the slope at the point to evaluate the value of +1 , therefore, the value of +1 should be greater than the true value if ( ) is concave function, as it is depicted by Fig. 8. In addition, the Runge-Kutta method takes the average slope between and +1 to evaluate the value of +1 , as is simply described by Fig. 12. Furthermore, (32) and (33) suggest that the computational complexity of the classic NR method is less than the GPU-based method. Therefore, it can provide better performance than the method in this paper, especially when the datasets is small. Although the classic NR method seemed to have an advantage over the GPU-based method when the power system is small, the GPU-based method can provide performance improvement and exact results when the power system is large enough to fully drive parallel potential of GPU.

D. COMPARISION OF GENERATING JACOBIAN MATRIX
This subsection presents results of GPU-based method compared with MATPOWER and CPU variants. Matpower GPU CPU Fig. 13 and Fig. 14 show that the calculating time of creating a fixed Jacobian matrix. For case 9241 and case 13659, the GPU-based method performs better than those of other methods. Most importantly, the calculating time of the method in the paper focuses on data transmission rather than calculations. Thus, the GPU-based method can almost keep constant even if power system is sufficiently large. Although forming the fixed Jacobian matrix, it laid the foundation to accelerate the whole power flow calculation.
On the contrary, MATPOWER and CPU-based methods perform better than that of GPU on small-scale power systems, for they are not involved in communicating between host and device. Once the power system become large enough, computational overhead would degenerate the performance. Therefore, the CPU-based method in Fig. 14 presents a continuous and fast increasing tendency when power system become large. The research building on the above facts demonstrates that the GPU-based method can enhance performance improvement when power systems are large enough to fully drive the parallel potential of GPU. Furthermore, as the power system become larger and more complicated, the GPU-based method can not only maintain a linear computational complexity, but also facilitate acceleration and scalability for the large-scale system in the future.

E. POWER FLOW EQUATIONS SOLVER
In this subsection, the comparison results of different solvers based on different hardware platform will be presented. Two programs of CPU versions are provided in this work, for the CPU-based LU solver is not stable.
Weighing speed and stability, CPU-based LU solver is replaced by the CPU-based QR solver, and the analysis of stability will be discussed. Table Ⅷ and Table Ⅸ list all implementation results including comparison of calculation time and speedup, respectively, on the GPU and CPU platform. From the results, GPU-based method can achieve outperformance since large systems started around 7000 buses and have an obvious advantage over other methods when the system become sufficiently large.   that GPU-based method also perform better than two other CPU-based methods. However, from Table Ⅷ and Table-Ⅸ, it is explicit to suggest stability of CPU-based LU method is the worse one among them. For case 2848 and 6515, the CPU-based LU method can fail to converge. Besides, the Cholesky factorization requires the matrix to be positive definition, which is stricter than the LU factorization. That is why CPU-based QR solver is introduced to replace CPU-base LU solver. In process of LU factorization, LU elimination steps are usually twice as cheap in terms of operations, as QR steps [45], and LU factorization update is based upon matrix-matrix multiplications, where one can use lower triangular matrices [46]. Specifically, the LU factorization is usually depicted, as follows: where 11 = 1 is a scalar for simplification. All of matrices in (33) are square and portioned identically; this setting leads to a unit lower triangular . Furthermore, the solving process of (33) can be written as:  (34) is nonzero elements during each LU decomposition. Therefore, for CPU-based LU method, the solver might fail to converge when pivot elements approach zero, because the computer usually regarded these values as unknowns. On the contrary, the QR factorization is always stable, but it requires almost twice as many operations, and a more complicated update step that is not as parallel as a matrix-matrix product [46], [47]. Specifically, the QR factorization is to decompose a matrix with linearly independent columns into a product of an orthogonal matrix [48]. It can be written as: = × (38) where is an × ( ≥ ) full column rank matrix, is an × orthogonal matrix, and is an × upper triangular matrix [48]. For brevity, the Householder transformation is employed in to obtain and . Reference [49] suggests that and can be obtained from Householder Reflection. The formula can be written as: where is a set of Householder Reflection matrices. The QR factorization avoids the partial pivoting strategy during the transformation process. Although the method of Householder transformation enhance stability, a set of transformation matrices are generated in QR steps, which causes a huge computational overhead. Fig. 17 just shows that the performance decrease between case 3120 and case 9241 due to the time-consuming operations about matrix decomposition and matrix-matrix multiplication of high order transformation matrices in the process of (39) and (40). Thinking of stability in power system, the tradeoff strategy is that the CPU-based QR solver is introduced as alternative to GPU-based LU solver.
Considering the results and analysis above, the GPUbased method can maintain high speed and stability even if the power systems become sufficiently large, which not only has an overall performance improvement but also provides an opportunity to improve scalability and compatibility for increasing complicated power system in the future

F. EXTENSION AND TEST OF THE MODIFIED METHOD
This subsection presents comparison of execution results between the GPU-based method in this paper and the GPU-FDPF method in the reference [32]. Furthermore, the GPUbased method is also tested on the distribution networks including IEEE 33bw, IEEE 85 and IEEE 141 bus systems. Table Ⅹ lists power flow calculation of these systems on the GPU-based method. Taking the fastest CPU-based power flow calculation package, MATPOWER, as a reference, the speedup of the GPU-based method is greater than the GPU-FDPF method in the reference [32]. Aside from case 1354, the GPU-based method performs better than the GPU-FDPF despite the stop criterion is 1e − 4 in the GPU-based method. Although the proposed GPU-based method gives an advantage of almost 1.5 times over the GPU-FDPF method for a system with over 9000 buses, the GPU-based method can provide better performance improvement as the power systems increase in the future.
From the above Table Ⅺ, the distribution networks perform better than non-distribution networks. Due to lacking PV nodes, the cost of communication between CPU and GPU is reduced. Therefore, the GPU-based method can fully focus on the high computational capability of the GPU and utilize the multi-threads to optimize the parallelism of the device.

VI. CONCLUSION
This paper presents a GPU-based sparse modified Newton method and demonstrates the accelerated effect for largescale power systems, which lays the foundation of largescale power flow applications. Specifically, the contributions of this work can be summarized as follows: 1) Instead of updating Jacobian matrix on each iteration, a fixed Jacobian matrix is introduced in a vectorization parallelization manner to relieve the computational burden. Although the fixed Jacobian matrix has a negative influence on convergence of power flow calculation, the parameter ∆ℎ of Runge-Kutta formula can be adjusted to fix the issue. Besides, the fixed Jacobian matrix might increase iterative numbers, but the cost of communication between host and device is greatly decreased, which computational overhead of iteration can be compensated by the great float-point processing capability of GPU.
In practice, Experiment proves that the time of creating a fixed Jacobian in a vectorization parallelization manner almost maintain a constant time even if the power system is sufficiently large, which improves the efficiency for the whole power flow calculation.
2) Instead of dense matrix format in [2], CRS format is taken to store the elements of the nodal admittance and the Jacobian matrix, which reduces much memory-consumption by excluding storage and computation of zero elements during the process of calculation and consequently, the computational complexity is reduced from ( 3 ) to ( 2 ).
3) compared with [2], inverting of the sparse Jacobian matrix to update power mismatch is converted into solving linear equations, for inversion of large-scale sparse matrix always not only needs allocate extra memory to store submatrix, but also calculate many potential fill-ins in the inverting process.
The results and analysis in this work demonstrate that the GPU-based method has a great advantage over the version of CPU-based method and MATPOWER, particularly, when power system is sufficiently large. Specifically, The GPUbased modified newton's method, respectively, achieves speedups of 234.8 times, 2.68 times and 1.39 times compared with CPU-based QR, CPU-based LU and MATPOWR, as power system is over 13,000 buses. It also demonstrates better performance results over the GPU-FDPF method in the reference [32]. Furthermore, the GPU-based method is also tested on the distribution networks including IEEE 33bw, IEEE 85 and IEEE 141 bus systems.
In addition, the future improvement can be focused on the GPU-based linear equation solver because creation of Jacobian keeps constant time. Furthermore, based on the analysis of stability, the novel method about GPU-based LU-QR hybrid solvers can be designed to improve performance and stability in the future research, for the hybrid method can combine the advantage of speed of LU method and stability of QR method.