The Art of Side Effects

Nabi, Zubair

doi:10.1007/978-1-4842-1479-4_6

Zubair Nabi²

2542 Accesses

Abstract

Spark Streaming applications by design are stateless and side-effect free: running the same application an infinite number of times results in the same behavior and output. Similar to functional programming, this simplifies debugging and reasoning about the state of a program, because input and output paths are deterministic. Although side-effect-free applications have many advantages, in distributed systems side effects cannot be completely avoided, especially when interfacing with external systems. For this reason, Spark Streaming provides a primitive called foreachRDD, which is the Swiss Army Knife of side effects for micro-batch processing. This chapter introduces design patterns for enabling side effects in Spark Streaming applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 29.99; Price excludes VAT (USA)

Softcover Book: USD 37.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://finance.yahoo.com/ .
2.
www.msn.com/en-us/money .
3.
www.bloomberg.com/professional/ .
4.
http://financial.thomsonreuters.com/en/products/tools-applications/trading-investment-tools/eikon-trading-software.html .
5.
https://github.com/brymck/finansu .
6.
https://developer.yahoo.com/yql/ .
7.
www.datatables.org/ .
8.
YQL queries can also be executed via its console: https://developer.yahoo.com/yql/console/ .
9.
www.datatables.org/yahoo/finance/yahoo.finance.quotes.xml .
10.
Typically used in Scala to return a concrete instance of a class.
11.
Marcin Kuthan, “Spark and Kafka Integration Patterns,” Allegro Tech, August 6, 2015, http://allegro.tech/2015/08/spark-kafka-integration.html .
12.
https://gist.github.com/koen-dejonghe/39c10357607c698c0b04 .
13.
https://commons.apache.org/proper/commons-pool/ .
14.
Fay Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” Proceedings of OSDI ‘06, 7 (USENIX Association, 2006).
15.
Download HBase (ver 1.1.2) from https://hbase.apache.org/ and run it via $HBASE_HOME/bin/start_hbase.sh. The shell can be accessed via $HBASE_HOME/bin/hbase shell. Note that the default settings constitute a test setup and should not be used in production. For details of a multinode production-grade installation, please consult the HBase documentation.
16.
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.
17.
Ted Malaska, “New in Cloudera Labs: SparkOnHBase,” Cloudera, December 18, 2014, http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ .
18.
Ted Malaska, “Apache Spark Comes to Apache HBase with HBase-Spark Module,” Cloudera, August 13, 2015, http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ .
19.
Cassandra (ver 2.1.11) can be downloaded from http://cassandra.apache.org/ . To start using the default single-node configuration, use $CASSANDRA_HOME/bin/cassandra start. The Cassandra CLI can be started via $CASSANDRA_HOME/bin/cassandra-cli.
20.
Data types for column values are called validators, and data types for column names are called comparators.
21.
https://github.com/datastax/spark-cassandra-connector .
22.
In the CQL world, Cassandra column families are now called tables.
23.
Don’t forget to set a checkpoint directory.
24.
Download Redis (ver 3.0.5) from http://redis.io/download , and build the project (make). Post-build run it with $REDIS_HOME/src/redis-server. To access the console, use $REDIS_HOME/src/redis-cli.

Author information

Authors and Affiliations

Lahore, Pakistan
Zubair Nabi

Authors

Zubair Nabi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nabi, Z. (2016). The Art of Side Effects. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_6

Download citation

DOI: https://doi.org/10.1007/978-1-4842-1479-4_6
Published: 14 June 2016
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-1480-0
Online ISBN: 978-1-4842-1479-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics