Multi-Core DSP Based Parallel Architecture for FMCW SAR Real-Time Imaging

This paper presents an efficient parallel processing architecture using multi-core Digital Signal Processor (DSP) to improve the capability of real-time imaging for Frequency Modulated Continuous Wave Synthetic Aperture Radar (FMCW SAR). With the application of the proposed processing architecture, the imaging algorithm is modularized, and each module is efficiently realized by the proposed processing architecture. In each module, the data processing of different cores is executed in parallel, also the data transmission and data processing of each core are synchronously carried out, so that the processing time for SAR imaging is reduced significantly. Specifically, the time of corner turning operation, which is very time-consuming, is ignored under computationally intensive case. The proposed parallel architecture is applied to a compact Ku-band FMCW SAR prototype to achieve real-time imageries with 34 cm × 51 cm (range × azimuth) resolution.


Introduction
Frequency Modulated Continuous Wave Synthetic Aperture Radar (FMCW SAR) has made significant advancement in the last decade [1]- [2].Benefiting from low-mass, low-power and low-cost, it is widely applied in different remote sensing fields, whereas one of the main challenges of FMCW SAR is the realization of real-time imaging.In the past decades, kinds of SAR imaging algorithms aiming high quality imagery have been proposed.All these algorithms are classified as frequency-domain algorithms and time-domain algorithms.Undoubtedly, the latter one is not a good choice for real-time processing.The popular frequency-domain SAR algorithms, specifically, include Range-Doppler Algorithm (RDA), Chirp-Scaling Algorithm (CSA), Nolinear Chirp-Scaling Algorithm (NCSA), Range Migration Algorithm (RMA) and Frequency-Scaling Algorithm (FSA) [1]- [4].In this paper, the extended FSA [2]- [3], which also contains two-order space-variant motion compensation and autofocus processing, is selected to verify the feasibility of the proposed parallel processing strategy.
Up to now, the simplified algorithm and multi-processor board have been applied to realize the real-time processing of SAR system.However, using simplified algorithm will drastically reduce the image quality, and using multiple processors will occupy too much valuable load and space resources of FMCW SAR.For the resolution comparison, the work in [5]- [6] demonstrates a recently developed real-time imaging FMCW SAR system (named MIRANDA35) from Fraunhofer Institute for High Frequency Physics and Radar Techniques (FHR) in 2014.In order to reduce computation, MIRANDA35 has applied with the RD algorithm and lookup table leading to a real-time resolution of 2 m.For the time-consumption comparison, the work in [7] has simply evaluated the multi-core Digital Signal Processor's (DSP's) capability for SAR signal processing, but the corner turning is still an independent process which can not be fully optimized.
For the realization of miniature light, low power consumption and high-resolution real-time FMCW SAR system, only an eight-core DSP named TMS320C6678 is used in this paper.Owing to multi-core DSP's compact structure and high performance, it would be widely used in light-weight and real-time SAR system in near future.In this paper, we proposed an efficient multi-core DSP based parallel processing structure, which is propitious to realize the real-time imaging of FMCW SAR with high resolution.The proposed architecture could be applied in the realizations of all kinds of frequency-domain SAR imaging algorithms.With the application of this processing structure, the extended FSA is realized in a modular manner in this paper, and each module of the algorithm uses the same architecture.In each module, the data processing of different cores is executed in parallel, also the data transmission and data processing of each core are synchronously carried out, so that the processing time and corner turning time are considerably reduced.For computationally intensive case, not only the data is processed in parallel, but also the corner turning time is fully released.Furthermore, in order to validate the performance of real-time imaging, several airborne experiments have been successfully performed.At the end of this paper, a 34 cm × 51 cm (range × azimuth) resolution SAR image of real-time processing is presented and analyzed.
The remainder of this paper is organized as follows.Section 2 reviews the imaging process of FMCW SAR, and the approach to modularization of the extend FSA is addressed.In Section 3, an efficient architecture for real-time processing is proposed, also the realization of the extend FSA is introduced in detail.Section 4 shows the results of realtime processing, also the qualities of imagery is analyzed specifically.Finally, Section 5 gives the conclusion of this paper.

Imaging Algorithm and its Modularizing 2.1 General FMCW SAR Imaging Model
As previously mentioned, FMCW SAR has the characteristics of compact structure and light weight.Consequently, the airborne based SAR imaging, which devotes to evaluate the performance of SAR system, can be facilely performed [11]- [12].Without loss of generality, the imaging geometry of airborne test is shown in Fig. 1.In Fig. 1, the solid line and dashed curve stand for the ideal and actual trajectories, respectively.The aircraft moves along track in the direction of Vt m , where V is the velocity along the motion trail and t m stands for the slow time.Furthermore, β and H denote the pitching angle and height of platform, respectively.R B and R (t m , R B ) stand for the closet range and the instantaneous range of point scatter P n .The transmitted linear frequency modulated signal is assumed to be [1]- [2] where t is the fast time (also called the range time variable), rect t T p is the rectangular function with the duration T p , which denotes the signal time width.f c and γ stand for the carrier frequency and the chirp rate of transmitted signal, respectively.The variable t is the summation of t and t m .The deramped signal is [2]- [4] in which, R is R (t m , R B ) in Fig. 1 and R r e f is the reference range, c is the velocity of light, A is the amplitude of the deramped signal.s r and s r e f stand for the received signal and the reference signal, respectively.R ∆ denotes the difference between the instantaneous range and the reference range.In order to reduce the required sampling rate, the transmitted and received signals are generally mixed, which also called dechirp-onreceive [1].Consequently, the intermediate frequency in ( 2) is simplified as ( Using the Taylor approximation, the instantaneous range is simplified as where λ is the wavelength, m is the azimuth-dependent distance to the point target, while f d denotes the Doppler frequency in azimuth, which is also expressed as [2]- [3] in which θ m stands for the instantaneous aspect angle [2].Substituting ( 4) and ( 5) into (3) yields where the first exponential term represents the Doppler modulation in azimuth.The second exponential term, which brings a new range migration, is introduced by the antenna movement within a sweep.The third one represents the range signal and the last exponential term is the Residual Video Phase (RVP), which is essential for FSA [1]- [3].

Modularization of the Extended FSA
According to the frequency-domain algorithms, RDA is not suitable for high resolution and wide swath width SAR imaging, and CSA is based on the linear frequency modulation signal, which is not feasible for deramped signal of FMCW SAR.RMA needs to execute time-consuming interpolation, which is not suitable for real-time processing.Consequently, in this paper, the extended FSA is selected as the real-time processing algorithm.
It is generally known that SAR imaging is a twodimensional processing, i.e. the received data is processed both in range and azimuth direction.However, the SAR raw data is digitally recorded in range which means the data is separated in cross-range.Moreover, the processor favors the data with contiguous memory in burst access way [8].Therefore, the data should be realigned into adjacent addresses before the azimuth processing.Furthermore, during the range dimension processing, e.g., the compensation factor is space-variant in azimuth, the data matrix must be processed completely before the azimuth dimension processing.Consequently, from a perspective of the real-time processing, it is recommended that the data should be processed in the same dimension.Based on the analysis above, the imaging algorithm is modularized in processing dimension way, and corner turning operation has been done after the processing of one dimension.The modularized processing flow of the extended FSA is shown in Fig. 2 [2].As shown in Fig. 2, the extended FSA is divided into five serial modules, and in each module three-step operations are comprised, i.e.Data-getting, Data-processing and Datastoring, respectively.Module 1 and module 3 process the range data, and the azimuth data is processed by module 2 and module 5. Module 4 carries out the time-consumption autofocus.Each module is effectively serially processed by the same parallel processing architecture mentioned below.

Parallel Processing of Imaging Module
Considering of the finite board load and rigid time requirement, only an eight-core DSP TMS320C6678 chip is mounted on a FMCW SAR processing board as the realtime processing kernel.It has excellent performance of 320 GMACS for fixed point and 160 GFLOPS for floating point [8].To process SAR data in a specified period, the data undergoes parallel processing on a compact processing board shown in Fig. 3.
As shown in Fig. 3, the size of FMCW SAR processing board is about 14 cm × 14 cm with an eight-core DSP chip (TMS320C6678), also four 256 MB Double Date Rate3 Synchronous Dynamic Random Access Memory (DDR3 SDRAM) chips are used as buffer memory during the realtime processing.It is known that, corner turning (matrix transposition) is a key step in SAR data processing, also a bottleneck of real-time SAR imaging.Thus improving the efficiency of corner turning is of great value for the real-time processing.TMS320C6678 uses Enhanced Direct Memory Access 3 (EDMA3) module to transfer data efficiently.It supports two addressing modes, which are constant addressing and increment addressing, respectively [8]- [9].In the following parallel processing frame, EDMA3 constant addressing is used to fetch the data to be processed, and EDMA3 increment addressing is used to realize the corner turning.Based on the characteristics of the serial modularized algorithm and EDMA3 corner turning operation, we divide the pending-processing data into eight parts to realize the parallel mapping of each module in Fig. 2. The efficient parallel processing architecture is presented in Fig. 4. In this two-part parallel processing architecture, eightcore DSP is the processing unit, and DDR3 SDRAM buffers high-volume data.Buffer A in DDR3 SDRAM is used to house the pending-processing data, and Buffer B stores the processed and transposed data.In Fig. 4, the data size of Buffer A is 4096 × 4096 (row × column) with the format of single-precision float-point leading to a data volume of 128 MB.The pending-processing data in Buffer A is equally divided into eight parts corresponding to eight cores of DSP chip.The size of each part is 0.5 K × 4 K, and each one is processed by a corresponding core.As mentioned in [10], the speed of EDMA3 corner turning increases with the decrease of Buffer B's column number.However, fewer columns lead more sub-matrixes to store the processed data, which will increase processing complexity of the next module.After comprehensive consideration, Buffer B is divided into two 4 K × 2 K sub-matrixes to store the processed data in EDMA3 corner turning way which tested speed is 478 MBps based on processor in Fig. 3 [10].In Buffer B, the first sub-matrix stores the processed and transposed data of former four cores, and the data of the latter four cores is stored in the second sub-matrix.
As shown in Fig. 4, to realize efficient PingPong processing, each core is equipped with four 32 KB PingPong IN/OUT buffers in the Multicore Shared Memory (MSM) SDRAM.The data in PingPong IN buffers is fetched from Buffer A in EDMA3 burst access way with a rapid throughput of 5 GBps [10], and the processed data in PingPong OUT buffers is stored into Buffer B in EDMA3 corner turning way.The PingPong processing flow of a single core is shown in Fig. 5.As shown in Fig. 5, the left half is the processing of Pong data, and the right half is the processing of Ping data.Before the Pong data processing, DSP core triggers the corner turning operation of the processed line (n − 1) data in Ping OUT buffer and the burst access operation of line (n + 1) data in Buffer A. After the triggers of data transmission, EDMA3 module transposes the processed line (n − 1) data from Ping OUT buffer to Buffer B, and transfers line (n + 1) data from Buffer A to Ping IN buffer.In the meantime, DSP core processes line n data in Pong IN buffer, and the result is cached into Pong OUT buffer.The next cycle is the processing of Ping data, while the corner turning of the processed line n data and the burst access of line (n + 2) data, DSP core processes the line (n+1) data synchronously.Each core in Fig. 4 executes the PingPong operation shown in Fig. 5, and with the PingPong processing, the data transmission time in total time consumption is greatly reduced.
In order to avoid the bus conflict between the cores, the done signal of the former core's corner turning is used to trigger the next core's corner turning.And the done signal is defined as Inter-Processor Communication (IPC) in Fig. 4. On the whole, with the application of the proposed parallel processing architecture, not only the data processing of different cores is executed in parallel, but also the data transmission and data processing of each core are synchronously carried out for maximum time reduction.

Time Sequence of Parallel Processing
To have a better insight of the parallel processing architecture, the time-sequence diagrams of single core processing and N cores parallel processing are shown in Fig. 6 and Fig. 7, respectively.where T Pro is the processing time of single line data, T G is the time of getting a line of pending processing data, T S is the time of storing a line of processed data in corner turning way.The time axis t core and t edma work synchronously.As depicted in Fig. 6, the Ping data transmission and Pong data processing are simultaneously proceed after the trigger of the Data-storing and Data-getting.The longer one of transmission time and processing time turns to be the total time consumption.The time sequence of N cores parallel processing is presented in Fig. 7. Theoretically speaking, if T Pro is larger than N times of data transmission time N (T G + T S ), the module timeconsumption is 1/N of the total processing time.Otherwise, the module time-consumption is close to the total transmission time where N and N a present the number of processing core and the sample number in azimuth, respectively.T N denotes the module time under N cores parallel processing.This characteristic is in favor of the sophisticated SAR algorithm, which pursues higher resolution imagery.

Time Consumption
To validate the proposed parallel architecture, the timeconsumption under different condition is adequately tested on the FMCW SAR processing board shown in Fig. 3.The tested data size and data volume are 4096 × 4096 (range × azimuth) and 128 MB, respectively.As mentioned in [10], T G and T S of a 32 KB continuous data are 6.5 µs (microsecond) and 65.7 µs, respectively.Thus the total transmission time turns to be 0.296 s (second), which denotes the base line in Fig. 8.In Fig. 8, T 1 , T 2 , T 4 , T 8 denote the time consumption of single core, two cores, four cores and eight cores processing, respectively.As shown in Fig. 8, the time variation of processing module increases with the improvement of processing complexity.For N cores parallel processing, if T Pr o is larger than N (T G + T S ), the integral time-consumption is compressed as (T Pro • N a ) /N.Otherwise, the total time turns to be (T G + T S ) • N a .The test result in Fig. 8 is commendably in keeping with (7).For comparison, the corner turning operation in [7] is a separate process, and each operation costs 100 ms with image size 4 K by 4 K. Four times of corner turning operations in the extended FSA will cost 400 ms in the real-time processing.The proposed processing architecture makes full use of the characteristic of EDMA [9].Under computationally intensive case, the time-consumption of the corner turning operation is totally eliminated.However, even if T Pro is less than N (T G + T S ), we still could perform other operation (e.g.preprocessing, data compression) during the corner turning to increase the efficiency of the proposed architecture.In a word, with the application of the proposed parallel processing strategy, the data processing time is linearly reduced.Most importantly, the Data-getting time and corner turning time are totally ignored under computationally intensive case.

Experimental Validation 4.1 Real-Time Imaging for Airborne SAR
The airborne test has been performed with a compact FMCW SAR system.As shown in Fig. 9, this compact FMCW SAR system is mounted on a light airplane.The main parameters of airborne real-time imaging are listed in Tab. 1.The sample number of the data matrix in azimuth is 4096 with a Pulse Repetition Frequency (PRF) of 1000 Hz, so the recording time yields to be 4.096 s.To guarantee an entirely focused image, 25% overlapped data in azimuth is processed.Hence, the limitation of real-time processing turns to be 3.072 s.
As shown in Fig. 2, the entire real-time processing flow is divided into five serial modules.Module 1 only has first order motion compensation, and module 2 just contains FFT in azimuth and Frequency-Scaling operation.Module 3 and module 5 need FFT/IFFT and the compensation of spacevariant factors many times.Module 4 is the time-consuming autofocus processing.In order to ensure processing precision, no lookup table is used.To illustrate the efficiency of the parallel strategy, the time-consumption of each module in single-core and eight-core case are listed in Table 2 As shown in Tab. 2, the time of module 1 and 2 have been compressed into the base time in Fig. 8. Module 3, 4 and 5 represent the computationally intensive task, and their eightcore processing results (T 8 ) have been largely reduced compared to the single-core results (T 1 ), e.g., the module 5 gains a speed-up ratio of 6.67.Considering the inter-cores synchronization and communication, this value denotes high efficiency of parallel processing.The processing time of the extended FS algorithm is significantly reduced from 11.399 s to 2.178 s, and the compressed result fully meets the real-time requirement.

SAR Imagery of Real-time Processing
Using the proposed parallel processing strategy, a realtime image of an island area overlaid onto an optical photograph is shown in Fig. 10.It can be seen that, the real-time processing experiment gains well results with perfect alignments of the imaging scene.Moreover, the characteristic targets, such as the roads and the buildings, are well focused.Based on the imagery above, we draw a conclusion that the efficiency of the improved approach for the real-time processing is proved.

Performance Analysis of Corner Reflectors
To evaluate the performance of real-time processed image, several corner reflectors are specifically scattered on an airport.The real-time imagery with the corner reflectors is shown in Fig. 11.As shown in Fig. 11, the imaging scene is 350 m × 420 m (range × azimuth).Corner reflectors from #1 to #5 in Fig. 11 are five point targets.To demonstrate the imaging performance, the two dimensional Impulse Response Width (IRW), Peak Side Lobe Ratio (PSLR) and Integrated Sidelobe Level Ratio (ISLR) of P2 in the object scene are listed in Tab. 3. Due to the variations of the algorithms implemented, it is difficult to have fair comparisons among architectures for SAR applications [7], nevertheless, Table 4 demonstrates a performance comparison between our system and MI-RANDA35 from FHR [5]- [6].With the Hamming weighting, the obtained values of the range resolution and azimuth resolution are 34 cm and 51 cm, respectively, which are very close to the theoretical result.The PSLR and ISLR results are pretty good in both directions.As shown in Tab. 3 and Tab. 4, the analysis results demonstrate a high focusing quality, and the efficiency of proposed parallel processing architecture is also validated.

Conclusion
In this paper, an efficient parallel processing strategy for SAR real-time imaging is proposed.With the application of proposed method, only an eight-core DSP is fully competent for the high-resolution real-time processing of FMCW SAR.Based on the proposed parallel processing strategy, the imaging algorithm is easily realized in modular design.More importantly, time consumption of SAR imaging is considerably reduced.The performance of proposed parallel processing strategy has been tested on a compact processing board.Moreover, the efficiency of real-time processing has been verified through airborne tests, and beneficial results have been achieved.

Fig. 10 .
Fig. 10.The real-time image overlaid onto an optical photo.

Tab. 4 .
The performance comparison with MIRANDA35.The azimuth and range profiles of corner reflector 2 in Fig. 11 are shown in Fig. 12(a) and Fig. 12(b), respectively.
Tab. 2. Time consumption summary of FS imaging modules.