A high-performance communication service for parallel computing on distributed DSP systems

doi:10.1016/S0167-8191(03)00061-9

Parallel Computing

Volume 29, Issue 7, July 2003, Pages 851-878

https://doi.org/10.1016/S0167-8191(03)00061-9 Get rights and content

Abstract

Rapid increases in the complexity of algorithms for real-time signal processing applications have led to performance requirements exceeding the capabilities of conventional digital signal processor (DSP) architectures. Many applications, such as autonomous sonar arrays, are distributed in nature and amenable to parallel computing on embedded systems constructed from multiple DSPs networked together. However, to realize the full potential of such applications, a lightweight service for message-passing communication and parallel process coordination is needed that is able to provide high throughput and low latency while minimizing processor and memory utilization. This paper presents the design and analysis of such a service, based on the message passing interface specification, for unicast and collective communications.

Introduction

With their emphasis on low power consumption and potent computational capability for signal processing applications, it is not surprising that digital signal processor (DSPs) have been employed in a multitude of applications. Similar to the general-purpose processor arena, the computational power of these special-purpose processors has continued to increase, providing the designer even more flexibility. However, many advanced signal processing applications continue to increase in complexity and require more computational power than a single processor can provide. To cope with these extreme demands, parallel processing techniques must often be employed.

Many applications targeted for DSPs, such as sonar array signal processing, are distributed in nature. Sonar system researchers have proposed using smart sensor nodes (i.e. each sensor node with its own processor) networked together in a distributed system to disperse the large computational burden imposed by the sensing algorithm [9], [10], [11]. Many of these remote-sensing applications require long distances between the sensing elements for accurate operation. Additionally, the sensing algorithms typically generate large amounts of inter-processor communication to distribute the computational burden among the smart nodes. Therefore, the network communication between these elements is just as critical to the performance of the application as the processing performed at the sensing locality. Although several systems have been proposed, the lack of proficient communication services for an embedded, distributed DSP system has limited the research to a general-purpose cluster of workstations.

Distributed computing has been extensively researched and proven a viable option for parallel applications. Several techniques have been explored to provide efficient communications between processing nodes, however, the standardization of the message passing interface (MPI) specification [22] has placed it as the dominant choice for communication services required in distributed computing [15]. Consequently, MPI has been explored on nearly every network architecture available [3], [7], [8], [13], [14], [19], [24], [25] and has the proven performance and functionality required for most parallel applications. Most of the research involved the use of general-purpose processors on a standardized network protocol. McMahon and Skjellum [21] did investigate the importance of reducing the full implementation for the limited memory space of an embedded system, however, their work did not take advantage of the hardware to provide the most efficient unicast and collective communications.

Analytical modeling of network performance has also been investigated extensively for distributed systems. Again, several techniques have been introduced, but the LogP [4] and LogGP [1] concepts have become a de facto standard. These models have been used to provide the basis for research in many areas of distributed computing, from assessment of network interface performance [5] to optimal broadcast and summation algorithms [17]. There are numerous examples available in a wide range of studies [6], [18], [20] that have used the LogP and LogGP models as a framework for performance and tradeoff analysis.

While widely examined for general-purpose systems, little research has investigated communications and synchronization services for special-purpose DSPs in arrays for distributed processing applications. This paper builds on proven techniques for general-purpose systems by providing a lightweight, MPI-compatible communication and coordination service for distributed DSP arrays assuming no network hardware support. The design leverages the architectural features of the DSP to provide low latency and high throughput on both unicast and collective communications. The system is compared to high-speed networks typically associated with distributed computing to evaluate its strengths and weaknesses. In addition, the LogP and LogGP framework is used to model the network communications to provide an accurate assessment of the design’s performance and scalability. In doing so, the effects of improved processor clock rate and network bandwidth on several communication functions are assessed.

Section 2 describes the distributed DSP system architecture used as a basis for this study, while Section 3 describes the design of a communication service suitable for a wide range of DSP arrays. Section 4 compares the performance of the system and service against other network architectures and topologies generally employed in distributed computing systems. Next, the network performance is modeled and validated using the LogP and LogGP concepts as a framework in Section 5. Section 6 then explores varying model parameters for an enhanced system using the previous modeling techniques to examine the tradeoffs between clock rate and network throughput. Finally, Section 7 provides conclusions and suggested directions for future research.

Section snippets

Distributed DSP system architecture

The similar application environments and design criteria for DSPs have caused many to converge to a reasonably common framework. They are not directly interchangeable, but the basic hardware primitives provided by DSP architectures for elementary communications (e.g., external access ports, integrated direct memory access (DMA) controllers, and internal SRAM) are common to devices available from multiple vendors. The similarity of these features allows a lightweight communication service to be

MPI communication service design

MPI, like most other network-oriented middleware services, communicates data from one process to another across a network. TCP/IP sockets and other similar protocols deliver the same general functionality. However, MPI’s higher level of abstraction provides an easy-to-use interface more appropriate for distributed, parallel computing applications. The MPI paradigm assumes a distributed-memory processing model, where each node has its own local address space and computes its data independently.

Performance analysis

This section explores the performance of the MPI-SHARC design in comparison with several network architectures commonly associated with distributed, parallel systems. Included is a range of topologies, protocols, and MPI service implementations to provide a brief cross-sectional study of distributed network architectures. Additionally, every function included in the MPI-SHARC design is evaluated through direct testing or implied by an understanding of the underlying communication pattern.

System modeling and validation

To explore the effects of system parameter tradeoffs on the MPI-SHARC communication service without implementing the costly physical hardware requires the use of a model. While many forms of interconnection performance models have been proposed, the LogGP model, which is a superset of the LogP model, most accurately conforms to the characteristics of the MPI-SHARC communication service and was adopted and extended. LogGP models have been employed in the literature on a wide range of algorithms

System tradeoff analysis

With a model for the MPI-SHARC communication service, it is possible to ascertain performance with various tradeoffs in the system architecture. Consistent with the trend in general-purpose processors, the SHARC and other DSP processor families are constantly being improved and superceded by faster designs. The next generation of the SHARC, the TigerSHARC, increases the processor clock rate to 150 MHz and the link port bandwidth to 150MB/s. Although the clock rate increases by almost a factor

Conclusion and future research

This paper has presented, analyzed, and modeled a lightweight communication service targeted for distributed, embedded DSP systems. By providing a reduced version of the widely accepted MPI protocol, the communication service can be readily employed for systems specifically targeted for an embedded, distributed environment such as sonar beamforming systems. Additionally, by leveraging the architectural features common to DSPs, as well as the ring topology, the design has proven to produce

Acknowledgements

This work was supported in part by grant N00014-99-1-0278 from the Office of Naval Research, and by equipment and software tools provided by vendor sponsors including Nortel Networks, Intel, Dell, and MPI Software Technology.

References (25)

J. Bruck et al.
Efficient message passing interface (MPI) for parallel computing on clusters of workstations
Journal of Parallel and Distributed Computing
(1997)
N. Fang et al.
MPI-DDL: a distributed-data library for MPI
Future Generation Computer Systems
(1997)
I. Foster et al.
Wide-area implementation of the message passing interface
Parallel Computing
(1998)
A. George et al.
Real-time sonar beamforming on high-performance distributed computers
Parallel Computing
(2000)
W. Gropp et al.
A high-performance MPI implementation on a shared-memory vector supercomputer
Parallel Computing
(1997)
W. Gropp et al.
A high-performance, portable implementation of the MPI message passing interface standard
Parallel Computing
(1996)
R. Hempel et al.
The emergence of the MPI message passing standard for parallel computing
Computer Standards & Interfaces
(1999)
M. Lauria et al.
MPI-FM: high performance MPI on workstation clusters
Journal of Parallel and Distributed Computing
(1997)
L. Prylli
The CAPDYN environment and its message-passing library implementation
Parallel Computing
(1997)
A. Alexandrov, M.F. Ionescu, K.E. Schauser, C. Scheiman, LogGP: incorporating long messages into the LogP model-one...

Bittware Research Systems, User’s Guide: blacktip-EX, Bittware Research Systems, Concord, NH, Revision 0,...

D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP: towards a...

Cited by (12)

The integration of the new advanced digital plasma control system in TCV
2008, Fusion Engineering and Design
Citation Excerpt :
Digital signal processors (DSPs) and field programmable gate arrays (FPGAs) have been employed in a variety of systems due to its high processing capability together with low power consumption and low price. To increase the performance and cope with distributed processing systems (like the APCS), complex parallel systems have been developed in several scientific areas [2]. Interface to such complex systems is a key issue on its usability, integration and algorithm development speed.
A new advanced digital plasma control system was developed aiming at a larger flexibility and enhanced performance controlling the TCV plasma shape, position, current and density. The system is a complex grid of 32 acquisition processing and control channels (APCCs) that can go up to 25 $μ$ s for the slow control cycle and a grid of 4 APCCs that can go up to 5 $μ$ s for the fast control cycle. Each APCC is composed of analogue input, digital signal processor (DSP) and analogue output. All APCCs are interconnected broadcasting data in each cycle.
A suitable interface between the system and the TCV plant enabling the complete integration into the existing schema running and controlling the TCV was designed and implemented. The system state-machine is presented. The advantages of using a structured integrated tool such as MDSPlus is evaluated, as well as the results on the final performance, usability and stability of the system.
Parallel short range molecular dynamics simulations on computer clusters: Performance evaluation and modeling
2005, Mathematical and Computer Modelling
This paper describes the performance of a portable molecular dynamics code running on an eight-node PC cluster. The molecular dynamics code is based on the atom decomposition method for distributing the computation load among the processors and the MPI protocol for managing communications among processors. We discuss the changes made to the serial code with an effort to maintain its readability. We examined the program performance for system sizes of order 10² to 10⁴ atoms and number of processors varying from 1 to 8, by measuring the total execution time and the corresponding speedup, as well as the communication time for data exchange and the time for the calculation of interatomic forces. Using simple communication and computation load considerations, we propose models in order to explain the observed behaviour and predict the optimal usage of the cluster. It turns out that using few parameters that can be easily measured one can predict quite accurately the optimal usage of small clusters running short range molecular dynamics programs.
Distributed memory mechanism in parallel multiprocessor motion control system
2014, Zhongguo Jixie Gongcheng/China Mechanical Engineering
High-performance real-time bus in parallel processing system
2013, Applied Mechanics and Materials
Reconfigurable multi-DSP parallel computing architecture based on DSM
2012, Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition)
Design of on-line detection system for vehicle rear axle based on vibration theory
2012, Applied Mechanics and Materials

View all citing articles on Scopus

View full text

A high-performance communication service for parallel computing on distributed DSP systems

Abstract

Introduction

Section snippets

Distributed DSP system architecture

MPI communication service design

Performance analysis

System modeling and validation

System tradeoff analysis

Conclusion and future research

Acknowledgements

Journal of Parallel and Distributed Computing

Future Generation Computer Systems

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Computer Standards & Interfaces

Journal of Parallel and Distributed Computing

Parallel Computing