Elsevier

Parallel Computing

Volume 29, Issue 7, July 2003, Pages 851-878
Parallel Computing

A high-performance communication service for parallel computing on distributed DSP systems

https://doi.org/10.1016/S0167-8191(03)00061-9Get rights and content

Abstract

Rapid increases in the complexity of algorithms for real-time signal processing applications have led to performance requirements exceeding the capabilities of conventional digital signal processor (DSP) architectures. Many applications, such as autonomous sonar arrays, are distributed in nature and amenable to parallel computing on embedded systems constructed from multiple DSPs networked together. However, to realize the full potential of such applications, a lightweight service for message-passing communication and parallel process coordination is needed that is able to provide high throughput and low latency while minimizing processor and memory utilization. This paper presents the design and analysis of such a service, based on the message passing interface specification, for unicast and collective communications.

Introduction

With their emphasis on low power consumption and potent computational capability for signal processing applications, it is not surprising that digital signal processor (DSPs) have been employed in a multitude of applications. Similar to the general-purpose processor arena, the computational power of these special-purpose processors has continued to increase, providing the designer even more flexibility. However, many advanced signal processing applications continue to increase in complexity and require more computational power than a single processor can provide. To cope with these extreme demands, parallel processing techniques must often be employed.

Many applications targeted for DSPs, such as sonar array signal processing, are distributed in nature. Sonar system researchers have proposed using smart sensor nodes (i.e. each sensor node with its own processor) networked together in a distributed system to disperse the large computational burden imposed by the sensing algorithm [9], [10], [11]. Many of these remote-sensing applications require long distances between the sensing elements for accurate operation. Additionally, the sensing algorithms typically generate large amounts of inter-processor communication to distribute the computational burden among the smart nodes. Therefore, the network communication between these elements is just as critical to the performance of the application as the processing performed at the sensing locality. Although several systems have been proposed, the lack of proficient communication services for an embedded, distributed DSP system has limited the research to a general-purpose cluster of workstations.

Distributed computing has been extensively researched and proven a viable option for parallel applications. Several techniques have been explored to provide efficient communications between processing nodes, however, the standardization of the message passing interface (MPI) specification [22] has placed it as the dominant choice for communication services required in distributed computing [15]. Consequently, MPI has been explored on nearly every network architecture available [3], [7], [8], [13], [14], [19], [24], [25] and has the proven performance and functionality required for most parallel applications. Most of the research involved the use of general-purpose processors on a standardized network protocol. McMahon and Skjellum [21] did investigate the importance of reducing the full implementation for the limited memory space of an embedded system, however, their work did not take advantage of the hardware to provide the most efficient unicast and collective communications.

Analytical modeling of network performance has also been investigated extensively for distributed systems. Again, several techniques have been introduced, but the LogP [4] and LogGP [1] concepts have become a de facto standard. These models have been used to provide the basis for research in many areas of distributed computing, from assessment of network interface performance [5] to optimal broadcast and summation algorithms [17]. There are numerous examples available in a wide range of studies [6], [18], [20] that have used the LogP and LogGP models as a framework for performance and tradeoff analysis.

While widely examined for general-purpose systems, little research has investigated communications and synchronization services for special-purpose DSPs in arrays for distributed processing applications. This paper builds on proven techniques for general-purpose systems by providing a lightweight, MPI-compatible communication and coordination service for distributed DSP arrays assuming no network hardware support. The design leverages the architectural features of the DSP to provide low latency and high throughput on both unicast and collective communications. The system is compared to high-speed networks typically associated with distributed computing to evaluate its strengths and weaknesses. In addition, the LogP and LogGP framework is used to model the network communications to provide an accurate assessment of the design’s performance and scalability. In doing so, the effects of improved processor clock rate and network bandwidth on several communication functions are assessed.

Section 2 describes the distributed DSP system architecture used as a basis for this study, while Section 3 describes the design of a communication service suitable for a wide range of DSP arrays. Section 4 compares the performance of the system and service against other network architectures and topologies generally employed in distributed computing systems. Next, the network performance is modeled and validated using the LogP and LogGP concepts as a framework in Section 5. Section 6 then explores varying model parameters for an enhanced system using the previous modeling techniques to examine the tradeoffs between clock rate and network throughput. Finally, Section 7 provides conclusions and suggested directions for future research.

Section snippets

Distributed DSP system architecture

The similar application environments and design criteria for DSPs have caused many to converge to a reasonably common framework. They are not directly interchangeable, but the basic hardware primitives provided by DSP architectures for elementary communications (e.g., external access ports, integrated direct memory access (DMA) controllers, and internal SRAM) are common to devices available from multiple vendors. The similarity of these features allows a lightweight communication service to be

MPI communication service design

MPI, like most other network-oriented middleware services, communicates data from one process to another across a network. TCP/IP sockets and other similar protocols deliver the same general functionality. However, MPI’s higher level of abstraction provides an easy-to-use interface more appropriate for distributed, parallel computing applications. The MPI paradigm assumes a distributed-memory processing model, where each node has its own local address space and computes its data independently.

Performance analysis

This section explores the performance of the MPI-SHARC design in comparison with several network architectures commonly associated with distributed, parallel systems. Included is a range of topologies, protocols, and MPI service implementations to provide a brief cross-sectional study of distributed network architectures. Additionally, every function included in the MPI-SHARC design is evaluated through direct testing or implied by an understanding of the underlying communication pattern.

System modeling and validation

To explore the effects of system parameter tradeoffs on the MPI-SHARC communication service without implementing the costly physical hardware requires the use of a model. While many forms of interconnection performance models have been proposed, the LogGP model, which is a superset of the LogP model, most accurately conforms to the characteristics of the MPI-SHARC communication service and was adopted and extended. LogGP models have been employed in the literature on a wide range of algorithms

System tradeoff analysis

With a model for the MPI-SHARC communication service, it is possible to ascertain performance with various tradeoffs in the system architecture. Consistent with the trend in general-purpose processors, the SHARC and other DSP processor families are constantly being improved and superceded by faster designs. The next generation of the SHARC, the TigerSHARC, increases the processor clock rate to 150 MHz and the link port bandwidth to 150MB/s. Although the clock rate increases by almost a factor

Conclusion and future research

This paper has presented, analyzed, and modeled a lightweight communication service targeted for distributed, embedded DSP systems. By providing a reduced version of the widely accepted MPI protocol, the communication service can be readily employed for systems specifically targeted for an embedded, distributed environment such as sonar beamforming systems. Additionally, by leveraging the architectural features common to DSPs, as well as the ring topology, the design has proven to produce

Acknowledgements

This work was supported in part by grant N00014-99-1-0278 from the Office of Naval Research, and by equipment and software tools provided by vendor sponsors including Nortel Networks, Intel, Dell, and MPI Software Technology.

References (25)

  • Bittware Research Systems, User’s Guide: blacktip-EX, Bittware Research Systems, Concord, NH, Revision 0,...
  • D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP: towards a...
  • Cited by (12)

    • The integration of the new advanced digital plasma control system in TCV

      2008, Fusion Engineering and Design
      Citation Excerpt :

      Digital signal processors (DSPs) and field programmable gate arrays (FPGAs) have been employed in a variety of systems due to its high processing capability together with low power consumption and low price. To increase the performance and cope with distributed processing systems (like the APCS), complex parallel systems have been developed in several scientific areas [2]. Interface to such complex systems is a key issue on its usability, integration and algorithm development speed.

    • Distributed memory mechanism in parallel multiprocessor motion control system

      2014, Zhongguo Jixie Gongcheng/China Mechanical Engineering
    • Reconfigurable multi-DSP parallel computing architecture based on DSM

      2012, Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition)
    View all citing articles on Scopus
    View full text