Abstract
Spark Streaming applications by design are stateless and side-effect free: running the same application an infinite number of times results in the same behavior and output. Similar to functional programming, this simplifies debugging and reasoning about the state of a program, because input and output paths are deterministic. Although side-effect-free applications have many advantages, in distributed systems side effects cannot be completely avoided, especially when interfacing with external systems. For this reason, Spark Streaming provides a primitive called foreachRDD, which is the Swiss Army Knife of side effects for micro-batch processing. This chapter introduces design patterns for enabling side effects in Spark Streaming applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
YQL queries can also be executed via its console: https://developer.yahoo.com/yql/console/ .
- 9.
- 10.
Typically used in Scala to return a concrete instance of a class.
- 11.
Marcin Kuthan, “Spark and Kafka Integration Patterns,” Allegro Tech, August 6, 2015, http://allegro.tech/2015/08/spark-kafka-integration.html .
- 12.
- 13.
- 14.
Fay Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” Proceedings of OSDI ‘06, 7 (USENIX Association, 2006).
- 15.
Download HBase (ver 1.1.2) from https://hbase.apache.org/ and run it via $HBASE_HOME/bin/start_hbase.sh. The shell can be accessed via $HBASE_HOME/bin/hbase shell. Note that the default settings constitute a test setup and should not be used in production. For details of a multinode production-grade installation, please consult the HBase documentation.
- 16.
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.
- 17.
Ted Malaska, “New in Cloudera Labs: SparkOnHBase,” Cloudera, December 18, 2014, http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ .
- 18.
Ted Malaska, “Apache Spark Comes to Apache HBase with HBase-Spark Module,” Cloudera, August 13, 2015, http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ .
- 19.
Cassandra (ver 2.1.11) can be downloaded from http://cassandra.apache.org/ . To start using the default single-node configuration, use $CASSANDRA_HOME/bin/cassandra start. The Cassandra CLI can be started via $CASSANDRA_HOME/bin/cassandra-cli.
- 20.
Data types for column values are called validators, and data types for column names are called comparators.
- 21.
- 22.
In the CQL world, Cassandra column families are now called tables.
- 23.
Don’t forget to set a checkpoint directory.
- 24.
Download Redis (ver 3.0.5) from http://redis.io/download , and build the project (make). Post-build run it with $REDIS_HOME/src/redis-server. To access the console, use $REDIS_HOME/src/redis-cli.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Zubair Nabi
About this chapter
Cite this chapter
Nabi, Z. (2016). The Art of Side Effects. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_6
Download citation
DOI: https://doi.org/10.1007/978-1-4842-1479-4_6
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-1480-0
Online ISBN: 978-1-4842-1479-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)