Skip to main content

Fault Tolerance and High Availability in Data Stream Management Systems

  • Living reference work entry
  • First Online:
Encyclopedia of Database Systems

Definition

Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators are spread across multiple processing nodes, i.e., independent processes typically running on different physical machines in a local-area network (LAN) or in a wide area network (WAN). Failures of processing nodes or failures in the underlying communication network can cause continuous queries (CQ) in a DSMS to stall or produce erroneous results. These failures can adversely affect critical client applications relying on these queries.

Traditionally, availability has been defined as the fraction of time that a system remains operational and properly services requests. In DSMSs, however, availability often also incorporates end-to-end latencies as applications need to quickly react to real-time events and thus can tolerate only small delays. A DSMS can handle failures using a...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  1. Balazinska M. Fault-tolerance and load management in a distributed stream processing system. Ph.D. thesis, Massachusetts Institute of Technology; 2006.

    Google Scholar 

  2. Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Fault-tolerance in the borealis distributed stream processing system. In: Proceedings of ACM SIGMOD international conference on management of data; 2005, p. 13–24.

    Google Scholar 

  3. Brewer EA. Lessons from giant-scale services. IEEE Internet Comput. 2001;5(4):46–55.

    Article  Google Scholar 

  4. Elnozahy ENM, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv. 2002;34(3):375–408.

    Article  Google Scholar 

  5. Gray J. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers; 1985.

    Google Scholar 

  6. Gray J, Helland P, O’ Neil P, Shasha D. The dangers of replication and a solution. In: Proceedings of ACM SIGMOD international conference on management of data; 1996, p. 173–82.

    Google Scholar 

  7. Hwang JH, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for distributed stream processing. In: Proceedings of 21st international conference on data engineering; 2005, p. 779–90.

    Google Scholar 

  8. Hwang JH, Xing Y, Çetintemel U, Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of 23rd international conference on data engineering; 2007, p. 176–85.

    Google Scholar 

  9. Kawell L, Beckhardt S, Halvorsen T, Ozzie R, Greif I. Replicated document management in a group communication system. In: Proceedings of ACM conference on computer-supported cooperative work; 1988.

    Google Scholar 

  10. Schiper A, Toueg S. From set membership to group membership: a separation of concerns. IEEE Trans Dependable Secure Comput. 2006;3(1):2–12.

    Article  Google Scholar 

  11. Schneider FB. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv. 1990;22(4):299–319.

    Article  Google Scholar 

  12. Schneider FB. What good are models and what models are good? In: Distributed systems. 2nd ed. ACM/Addison-Wesley Publishing Co.; 1993,p. 17–26.

    Google Scholar 

  13. Shah MA. Flux: a mechanism for building robust, scalable dataflows. Ph.D. thesis, University of California, Berkeley; 2004.

    Google Scholar 

  14. Shah M, Hellerstein J, Brewer E. Highly-available, fault-tolerant, parallel dataflows. In: Proceedings of ACM SIGMOD international conference on management of data; 2004, p. 827–38.

    Google Scholar 

  15. Terry DB, Theimer M, Petersen K, Demers AJ, Spreitzer M, Hauser C. Managing update conflicts in Bayou, a weakly connected replicated storage system. In: Proceedings of 15th ACM symposium on operating system principles; 1995, p. 172–83.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Magdalena Balazinska .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media LLC

About this entry

Cite this entry

Balazinska, M., Hwang, JH., Shah, M.A. (2017). Fault Tolerance and High Availability in Data Stream Management Systems. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_160-2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4899-7993-3_160-2

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4899-7993-3

  • Online ISBN: 978-1-4899-7993-3

  • eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics