research-article

Samza: stateful scalable stream processing at LinkedIn

Authors:
Shadi A. Noghabi

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Kartik Paramasivam

LinkedIn Corp

LinkedIn Corp
View Profile

,
Yi Pan

LinkedIn Corp

LinkedIn Corp
View Profile

,
Navina Ramesh

LinkedIn Corp

LinkedIn Corp
View Profile

,
Jon Bringhurst

LinkedIn Corp

LinkedIn Corp
View Profile

,
Indranil Gupta

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Roy H. Campbell

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 12pp 1634–1645https://doi.org/10.14778/3137765.3137770

Published:01 August 2017Publication History

Proceedings of the VLDB Endowment

Abstract

Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affinity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda-based architectures which necessitate maintenance of separate code bases for batch and stream path processing).

Samza is currently in use at LinkedIn by hundreds of production applications with more than 10, 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Netflix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state efficiently, improving latency and throughput by more than 100X compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffic.

References

Databus. https://github.com/linkedin/databus.Google Scholar
MongoDB. https://www.mongodb.com.Google Scholar
Powered by samza. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By.Google Scholar
RocksDB. http://rocksdb.org.Google Scholar
Trident. http://storm.apache.org/Trident-tutorial.html.Google Scholar
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, et al. The design of the Borealis stream processing engine. In Proc. CIDR, pages 277--289, 2005.Google Scholar
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, et al. Millwheel: fault-tolerant stream processing at internet scale. Proc. VLDB, pages 1033--1044, 2013. Google ScholarDigital Library
T. Akidau, R. Bradshaw, C. Chambers, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB, 8(12):1792--1803, 2015. Google ScholarDigital Library
Amazon. DynamoDB streams. http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html.Google Scholar
Amazon. Kinesis. https://aws.amazon.com/kinesis/.Google Scholar
L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, et al. SPC: a distributed, scalable platform for data mining. In Proc. IWMSSP, pages 27--37. ACM, 2006. Google ScholarDigital Library
Apache. Beam. http://beam.incubator.apache.org.Google Scholar
Apache. Flink. https://flink.apache.org.Google Scholar
Apache. Hadoop. http://hadoop.apache.org/.Google Scholar
Apache. Kafaka - powered by. https://cwiki.apache.org/confluence/display/KAFKA/Powered+By.Google Scholar
A. Auradkar, C. Botev, S. Das, et al. Data infrastructure at LinkedIn. In Proc. ICDE, pages 1370--1381, 2012. Google ScholarDigital Library
A. AWS. Lambda. https://aws.amazon.com/lambda/.Google Scholar
R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In Proc. SIGMOD, pages 725--736. ACM, 2013. Google ScholarDigital Library
K. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. TOCS, pages 63--75, 1985. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, et al. Bigtable: A distributed storage system for structured data. TOCS, 26(2):4:1--4:26, 2008. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, et al. Mapreduce online. In Proc. NSDI, pages 20--25, 2010. Google ScholarDigital Library
E. P. Corporation. Benchmarking top nosql databases. Technical Report, page 19, 2015.Google Scholar
Couchbase. Couchbase. http://www.couchbase.com.Google Scholar
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
L. Engineering. Benchmarking apache kafka: 2 million writes per second (on three cheap machines). https://engineering.linkedin.com/kafka.Google Scholar
R. Fernandez, P. Pietzuch, et al. Liquid: Unifying nearline and offline big data integration. In Proc. CIDR, page 8.Google Scholar
R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Making state explicit for imperative big data processing. In Proc. ATC, pages 49--60. USENIX, 2014. Google ScholarDigital Library
T. A. S. Foundation. Apache HBase. http://hbase.apache.org/.Google Scholar
Y. Gu, Z. Zhang, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu. An empirical study of high availability in stream processing systems. In Proc. Middleware, page 23, 2009. Google ScholarDigital Library
V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. Streamcloud: An elastic and scalable data streaming system. TPDS, 23(12):2351--2365, 2012. Google ScholarDigital Library
P. Hintjens. ZeroMQ: Messaging for Many Applications. O'Reilly Media, Inc., 2013.Google Scholar
J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, et al. High-availability algorithms for distributed stream processing. In Proc. ICDE'05, pages 779--790, 2005. Google ScholarDigital Library
G. Jacques-Silva, F. Zheng, D. Debrunner, K.-L. Wu, V. Dogaru, et al. Consistent regions: Guaranteed tuple processing in ibm streams. Proc. VLDB, 9(13):1341--1352, 2016. Google ScholarDigital Library
J. Kreps, N. Narkhede, et al. Kafka: A distributed messaging system for log processing. In Proc. NetDB, pages 1--7, 2011.Google Scholar
S. Kulkarni, N. Bhagat, M. Fu, et al. Twitter heron: Stream processing at scale. In Proc. SIGMOD, pages 239--250, 2015. Google ScholarDigital Library
A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. In Proc. SIGOPS OSR, pages 35--40, 2010. Google ScholarDigital Library
W. Lin, Z. Qian, J. Xu, S. Yang, J. Zhou, and L. Zhou. Streamscope: continuous reliable distributed processing of big data streams. In Proc. NSDI, pages 439--454, 2016. Google ScholarDigital Library
B. Liu, Y. Zhu, and E. Rundensteiner. Run-time operator state spilling for memory intensive long-running queries. In Proc. SIGMOD, pages 347--358. ACM, 2006. Google ScholarDigital Library
N. Marz and J. Warren. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. Manning Publications Co., 1st edition, 2015. Google ScholarDigital Library
J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Cetintemel, et al. S-store: Streaming meets transaction processing. Proc. VLDB, pages 2134--2145, 2015. Google ScholarDigital Library
Microsoft. Azure event hub. https://azure.microsoft.com/en-us/services/event-hubs/.Google Scholar
MySQL. Mysql. http://www.mysql.com.Google Scholar
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Prod. ICDM Workshop, pages 170--177. IEEE, 2010. Google ScholarDigital Library
S. A. Noghabi, S. Subramanian, P. Narayanan, S. Narayanan, G. Holla, et al. Ambry: Linkedin's scalable geo-distributed object store. In Proc. SIGMOD, pages 253--265. ACM, 2016. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proc. SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
Oracle. Package java.util.stream. https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html.Google Scholar
M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In Proc. SoCC, pages 195--208. ACM, 2015. Google ScholarDigital Library
L. Qiao, K. Surlaker, S. Das, T. Quiggle, B. Schulman, et al. On brewing fresh espresso: Linkedin's distributed data serving platform. In Proc. SIGMOD, pages 1135--1146. ACM, 2013. Google ScholarDigital Library
T. Rabl, S. Gómez-Villamor, M. Sadoghi, et al. Solving big data challenges for enterprise application performance management. Proc. VLDB, 5(12):1724--1735, 2012. Google ScholarDigital Library
S. Sanfilippo. Redis. http://redis.io.Google Scholar
Z. Sebepou and K. Magoutis. Cec: Continuous eventual checkpointing for data stream processing operators. In Proc. IEEE/IFIP DSN, pages 145--156, 2011. Google ScholarDigital Library
M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. SIGMOD Record, 34(4):42--47, 2005. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, et al. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, et al. Storm@ twitter. In Proc. SIGMOD, pages 147--156, 2014. Google ScholarDigital Library
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, et al. Apache hadoop YARN: Yet Another Resource Negotiator. In Proc. SOSP, page 5. ACM, 2013.Google Scholar
R. Wagle, H. Andrade, K. Hildrum, C. Venkatramani, et al. Distributed middleware reliability and fault tolerance support in systems. In Proc. DEBS, pages 335--346, 2011. Google ScholarDigital Library
L. Xu, B. Peng, and I. Gupta. Stela: Enabling stream processing systems to scale-in and scale-out on-demand. In Proc. IC2E, pages 22--31, 2016.Google ScholarCross Ref
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proc. HotCloud, page 95, 2010. Google ScholarDigital Library
M. Zaharia, T. Das, H. Li, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proc. HotCloud, pages 10--10, 2012. Google ScholarDigital Library

Index Terms

Samza: stateful scalable stream processing at LinkedIn

Index terms have been assigned to the content through auto-classification.

Recommendations

A memory capacity model for high performing data-filtering applications in Samza framework
BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

Data quality is essential in big data paradigm as poor data can have serious consequences when dealing with large volumes of data. While it is trivial to spot poor data for small-scale and offline use cases, it is challenging to detect and fix data ...
Read More
Study On Purchase Intention In Different Live Streaming Scenarios Based On Experimental Approach
ICEBI '22: Proceedings of the 2022 6th International Conference on E-Business and Internet

Live streaming e-commerce has exploded recently. While the live streaming traffic is dominated by the top live streamers, merchants and ordinary live streamers attempt to establish self-operating live streaming, but the number of fans and sales ...
Read More
MedSMan: a live multimedia stream querying system

Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications. Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 12
August 2017
427 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2017
Published in pvldb Volume 10, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 71
  Total Citations
  View Citations
- 1,249
  Total Downloads
- Downloads (Last 12 months)150
- Downloads (Last 6 weeks)47
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Samza: stateful scalable stream processing at LinkedIn

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A memory capacity model for high performing data-filtering applications in Samza framework

Study On Purchase Intention In Different Live Streaming Scenarios Based On Experimental Approach

MedSMan: a live multimedia stream querying system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Samza: stateful scalable stream processing at LinkedIn

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A memory capacity model for high performing data-filtering applications in Samza framework

Study On Purchase Intention In Different Live Streaming Scenarios Based On Experimental Approach

MedSMan: a live multimedia stream querying system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media