Skip to main content

The Art of Side Effects

  • Chapter
  • First Online:
Pro Spark Streaming
  • 2542 Accesses

Abstract

Spark Streaming applications by design are stateless and side-effect free: running the same application an infinite number of times results in the same behavior and output. Similar to functional programming, this simplifies debugging and reasoning about the state of a program, because input and output paths are deterministic. Although side-effect-free applications have many advantages, in distributed systems side effects cannot be completely avoided, especially when interfacing with external systems. For this reason, Spark Streaming provides a primitive called foreachRDD, which is the Swiss Army Knife of side effects for micro-batch processing. This chapter introduces design patterns for enabling side effects in Spark Streaming applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 29.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 37.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://finance.yahoo.com/ .

  2. 2.

    www.msn.com/en-us/money .

  3. 3.

    www.bloomberg.com/professional/ .

  4. 4.

    http://financial.thomsonreuters.com/en/products/tools-applications/trading-investment-tools/eikon-trading-software.html .

  5. 5.

    https://github.com/brymck/finansu .

  6. 6.

    https://developer.yahoo.com/yql/ .

  7. 7.

    www.datatables.org/ .

  8. 8.

    YQL queries can also be executed via its console: https://developer.yahoo.com/yql/console/ .

  9. 9.

    www.datatables.org/yahoo/finance/yahoo.finance.quotes.xml .

  10. 10.

    Typically used in Scala to return a concrete instance of a class.

  11. 11.

    Marcin Kuthan, “Spark and Kafka Integration Patterns,” Allegro Tech, August 6, 2015, http://allegro.tech/2015/08/spark-kafka-integration.html .

  12. 12.

    https://gist.github.com/koen-dejonghe/39c10357607c698c0b04 .

  13. 13.

    https://commons.apache.org/proper/commons-pool/ .

  14. 14.

    Fay Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” Proceedings of OSDI ‘06, 7 (USENIX Association, 2006).

  15. 15.

    Download HBase (ver 1.1.2) from https://hbase.apache.org/ and run it via $HBASE_HOME/bin/start_hbase.sh. The shell can be accessed via $HBASE_HOME/bin/hbase shell. Note that the default settings constitute a test setup and should not be used in production. For details of a multinode production-grade installation, please consult the HBase documentation.

  16. 16.

    org.apache.hadoop.hbase.mapreduce.TableOutputFormat.

  17. 17.

    Ted Malaska, “New in Cloudera Labs: SparkOnHBase,” Cloudera, December 18, 2014, http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ .

  18. 18.

    Ted Malaska, “Apache Spark Comes to Apache HBase with HBase-Spark Module,” Cloudera, August 13, 2015, http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ .

  19. 19.

    Cassandra (ver 2.1.11) can be downloaded from http://cassandra.apache.org/ . To start using the default single-node configuration, use $CASSANDRA_HOME/bin/cassandra start. The Cassandra CLI can be started via $CASSANDRA_HOME/bin/cassandra-cli.

  20. 20.

    Data types for column values are called validators, and data types for column names are called comparators.

  21. 21.

    https://github.com/datastax/spark-cassandra-connector .

  22. 22.

    In the CQL world, Cassandra column families are now called tables.

  23. 23.

    Don’t forget to set a checkpoint directory.

  24. 24.

    Download Redis (ver 3.0.5) from http://redis.io/download , and build the project (make). Post-build run it with $REDIS_HOME/src/redis-server. To access the console, use $REDIS_HOME/src/redis-cli.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Zubair Nabi

About this chapter

Cite this chapter

Nabi, Z. (2016). The Art of Side Effects. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_6

Download citation

Publish with us

Policies and ethics