Skip to main content

Analysis of Memory Constrained Live Provenance

  • Conference paper
  • First Online:
Provenance and Annotation of Data and Processes (IPAW 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9672))

Included in the following conference series:

Abstract

We conjecture that meaningful analysis of large-scale provenance can be preserved by analyzing provenance data in limited memory while the data is still in motion; that the provenance needs not be fully resident before analysis can occur. As a proof of concept, this paper defines a stream model for reasoning about provenance data in motion for Big Data provenance. We propose a novel streaming algorithm for the backward provenance query, and apply it to the live provenance captured from agent-based simulations. The performance test demonstrates high throughput, low latency and good scalability, in a distributed stream processing framework built on Apache Kafka and Spark Streaming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, D.J., Ahmad, Y., Balazinska et al.: The design of the Borealis stream processing engine. In: CIDR, vol. 5, pp. 277–289 (2005)

    Google Scholar 

  2. Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language for event processing and stream reasoning. In: WWW, pp. 635–644. ACM (2011)

    Google Scholar 

  3. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS, pp. 1–16. ACM (2002)

    Google Scholar 

  4. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL: SPARQL for continuous querying. In: WWW, pp. 1061–1062. ACM (2009)

    Google Scholar 

  5. Buneman, P., Khanna, S., Tan, W.-C.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Chen, P., Plale, B., Evans, T.: Dependency provenance in agent based modeling. In: eScience, pp. 180–187. IEEE (2013)

    Google Scholar 

  7. Chen, P., Plale, B.A.: Proverr: system level statistical fault diagnosis using dependency model. In: CCGrid, pp. 525–534. IEEE (2015)

    Google Scholar 

  8. Cheney, J.: Program slicing and data provenance. IEEE Data Eng. Bull. 30(4), 22–28 (2007)

    Google Scholar 

  9. Cheney, J., Ahmed, A., Acar, U.A.: Provenance as dependency analysis. In: Arenas, M. (ed.) DBPL 2007. LNCS, vol. 4797, pp. 138–152. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  10. De Pauw, W., Leţia, M., Gedik, B., Andrade, H., Frenkiel, A., Pfeifer, M., Sow, D.: Visual debugging for stream processing applications. In: Barringer, H., Falcone, Y., Finkbeiner, B., Havelund, K., Lee, I., Pace, G., Roşu, G., Sokolsky, O., Tillmann, N. (eds.) RV 2010. LNCS, vol. 6418, pp. 18–35. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Evans, T., Plale, B., Attari, S.: WSC-Category 2 collaborative: Impacts of agricultural decision making and adaptive management on food security in africa, National Science Foundation grant 1360463 (2014)

    Google Scholar 

  12. Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Gehani, A., Tariq, D.: SPADE: Support for Provenance Auditing in Distributed Environments. In: Narasimhan, P., Triantafillou, P. (eds.) Middleware 2012. LNCS, vol. 7662, pp. 101–120. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Gupta, A.K., Suciu, D.: Stream processing of XPath queries with predicates. In: SIGMOD, pp. 419–430. ACM (2003)

    Google Scholar 

  15. Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: NetDB, pp. 1–7 (2011)

    Google Scholar 

  16. Lu, R., Wu, G., Xie, B., Hu, J.: Stream bench: towards benchmarking modern distributed stream computing frameworks. In: UCC, pp. 69–78. IEEE (2014)

    Google Scholar 

  17. McGregor, A.: Graph stream algorithms: a survey. ACM SIGMOD Rec. 43(1), 9–20 (2014)

    Article  Google Scholar 

  18. Misra, A., Blount, M.L., Kementsietsidis, A., Sow, D., Wang, M.: Advances and challenges for scalable provenance in stream processing systems. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 253–265. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  19. Cuevas-Vicenttín, V., Kianmajd, P., Ludäscher, B., Missier, P., Chirigati, F., Wei, Y., Koop, D., Dey, S.: Provenance storage, querying, and visualization in PBase. In: Ludaescher, B., Plale, B. (eds.) IPAW 2014. LNCS, vol. 8628, pp. 239–241. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  20. Moreau, L., Missier, P., et al.: PROV-DM: The PROV Data Model. W3C Working Group Note 30, April 2013

    Google Scholar 

  21. Plale, B., Schwan, K.: Dynamic querying of streaming data with the dQUOB system. TPDS 14(4), 422–432 (2003)

    Google Scholar 

  22. Sansrimahachai, W., Moreau, L., Weal, M.J.: An on-the-fly provenance tracking mechanism for stream processing systems. In: ICIS, pp. 475–481. IEEE (2013)

    Google Scholar 

  23. Sansrimahachai, W., Weal, M.J., Moreau, L.: Stream ancestor function: a mechanism for fine-grained provenance in stream processing systems. In: RCIS, pp. 1–12. IEEE (2012)

    Google Scholar 

  24. Suriarachchi, I., Zhou, Q., Plale, B.: Komadu: a capture and visualization system for scientific data provenance. J. Open Res. Softw. 3(1), e4 (2015). http://doi.org/10.5334/jors.bq

    Article  Google Scholar 

  25. Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: ACMSE, p. 42. ACM (2010)

    Google Scholar 

  26. Vijayakumar, N., Plale, B.: Tracking stream provenance in complex event processing systems for workflow-driven computing. In: EDA-PS Workshop (2007)

    Google Scholar 

  27. Vijayakumar, N.N., Plale, B.: Towards low overhead provenance tracking in near real-time stream filtering. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 46–54. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  28. Wilensky, U.: Netlogo (1999). http://ccl.northwestern.edu/netlogo/

  29. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud, vol. 10, p. 10. USENIX (2010)

    Google Scholar 

  30. Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: HotCloud. p. 10. USENIX (2012)

    Google Scholar 

Download references

Acknowledgment

This work is funded in part by the National Science Foundation under award number 1360463.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Chen, P., Evans, T., Plale, B. (2016). Analysis of Memory Constrained Live Provenance. In: Mattoso, M., Glavic, B. (eds) Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science(), vol 9672. Springer, Cham. https://doi.org/10.1007/978-3-319-40593-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-40593-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-40592-6

  • Online ISBN: 978-3-319-40593-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics