Abstract
We conjecture that meaningful analysis of large-scale provenance can be preserved by analyzing provenance data in limited memory while the data is still in motion; that the provenance needs not be fully resident before analysis can occur. As a proof of concept, this paper defines a stream model for reasoning about provenance data in motion for Big Data provenance. We propose a novel streaming algorithm for the backward provenance query, and apply it to the live provenance captured from agent-based simulations. The performance test demonstrates high throughput, low latency and good scalability, in a distributed stream processing framework built on Apache Kafka and Spark Streaming.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, D.J., Ahmad, Y., Balazinska et al.: The design of the Borealis stream processing engine. In: CIDR, vol. 5, pp. 277–289 (2005)
Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language for event processing and stream reasoning. In: WWW, pp. 635–644. ACM (2011)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS, pp. 1–16. ACM (2002)
Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL: SPARQL for continuous querying. In: WWW, pp. 1061–1062. ACM (2009)
Buneman, P., Khanna, S., Tan, W.-C.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)
Chen, P., Plale, B., Evans, T.: Dependency provenance in agent based modeling. In: eScience, pp. 180–187. IEEE (2013)
Chen, P., Plale, B.A.: Proverr: system level statistical fault diagnosis using dependency model. In: CCGrid, pp. 525–534. IEEE (2015)
Cheney, J.: Program slicing and data provenance. IEEE Data Eng. Bull. 30(4), 22–28 (2007)
Cheney, J., Ahmed, A., Acar, U.A.: Provenance as dependency analysis. In: Arenas, M. (ed.) DBPL 2007. LNCS, vol. 4797, pp. 138–152. Springer, Heidelberg (2007)
De Pauw, W., Leţia, M., Gedik, B., Andrade, H., Frenkiel, A., Pfeifer, M., Sow, D.: Visual debugging for stream processing applications. In: Barringer, H., Falcone, Y., Finkbeiner, B., Havelund, K., Lee, I., Pace, G., Roşu, G., Sokolsky, O., Tillmann, N. (eds.) RV 2010. LNCS, vol. 6418, pp. 18–35. Springer, Heidelberg (2010)
Evans, T., Plale, B., Attari, S.: WSC-Category 2 collaborative: Impacts of agricultural decision making and adaptive management on food security in africa, National Science Foundation grant 1360463 (2014)
Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005)
Gehani, A., Tariq, D.: SPADE: Support for Provenance Auditing in Distributed Environments. In: Narasimhan, P., Triantafillou, P. (eds.) Middleware 2012. LNCS, vol. 7662, pp. 101–120. Springer, Heidelberg (2012)
Gupta, A.K., Suciu, D.: Stream processing of XPath queries with predicates. In: SIGMOD, pp. 419–430. ACM (2003)
Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: NetDB, pp. 1–7 (2011)
Lu, R., Wu, G., Xie, B., Hu, J.: Stream bench: towards benchmarking modern distributed stream computing frameworks. In: UCC, pp. 69–78. IEEE (2014)
McGregor, A.: Graph stream algorithms: a survey. ACM SIGMOD Rec. 43(1), 9–20 (2014)
Misra, A., Blount, M.L., Kementsietsidis, A., Sow, D., Wang, M.: Advances and challenges for scalable provenance in stream processing systems. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 253–265. Springer, Heidelberg (2008)
Cuevas-VicenttÃn, V., Kianmajd, P., Ludäscher, B., Missier, P., Chirigati, F., Wei, Y., Koop, D., Dey, S.: Provenance storage, querying, and visualization in PBase. In: Ludaescher, B., Plale, B. (eds.) IPAW 2014. LNCS, vol. 8628, pp. 239–241. Springer, Heidelberg (2015)
Moreau, L., Missier, P., et al.: PROV-DM: The PROV Data Model. W3C Working Group Note 30, April 2013
Plale, B., Schwan, K.: Dynamic querying of streaming data with the dQUOB system. TPDS 14(4), 422–432 (2003)
Sansrimahachai, W., Moreau, L., Weal, M.J.: An on-the-fly provenance tracking mechanism for stream processing systems. In: ICIS, pp. 475–481. IEEE (2013)
Sansrimahachai, W., Weal, M.J., Moreau, L.: Stream ancestor function: a mechanism for fine-grained provenance in stream processing systems. In: RCIS, pp. 1–12. IEEE (2012)
Suriarachchi, I., Zhou, Q., Plale, B.: Komadu: a capture and visualization system for scientific data provenance. J. Open Res. Softw. 3(1), e4 (2015). http://doi.org/10.5334/jors.bq
Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: ACMSE, p. 42. ACM (2010)
Vijayakumar, N., Plale, B.: Tracking stream provenance in complex event processing systems for workflow-driven computing. In: EDA-PS Workshop (2007)
Vijayakumar, N.N., Plale, B.: Towards low overhead provenance tracking in near real-time stream filtering. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 46–54. Springer, Heidelberg (2006)
Wilensky, U.: Netlogo (1999). http://ccl.northwestern.edu/netlogo/
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud, vol. 10, p. 10. USENIX (2010)
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: HotCloud. p. 10. USENIX (2012)
Acknowledgment
This work is funded in part by the National Science Foundation under award number 1360463.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Chen, P., Evans, T., Plale, B. (2016). Analysis of Memory Constrained Live Provenance. In: Mattoso, M., Glavic, B. (eds) Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science(), vol 9672. Springer, Cham. https://doi.org/10.1007/978-3-319-40593-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-40593-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40592-6
Online ISBN: 978-3-319-40593-3
eBook Packages: Computer ScienceComputer Science (R0)