skip to main content
10.1145/3529372.3530936acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article
Open Access

StreamingHub: interactive stream analysis workflows

Published:20 June 2022Publication History

ABSTRACT

Reusable data/code and reproducible analyses are foundational to quality research. This aspect, however, is often overlooked when designing interactive stream analysis workflows for time-series data (e.g., eye-tracking data). A mechanism to transmit informative metadata alongside data may allow such workflows to intelligently consume data, propagate metadata to downstream tasks, and thereby auto-generate reusable, reproducible analytic outputs with zero supervision. Moreover, a visual programming interface to design, develop, and execute such workflows may allow rapid prototyping for interdisciplinary research. Capitalizing on these ideas, we propose StreamingHub, a framework to build metadata propagating, interactive stream analysis workflows using visual programming. We conduct two case studies to evaluate the generalizability of our framework. Simultaneously, we use two heuristics to evaluate their computational fluidity and data growth. Results show that our framework generalizes to multiple tasks with a minimal performance overhead.

References

  1. Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, et al. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Proceedings of the VLDB Endowment 8 (2015), 1792--1803.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludascher, et al. 2004. Kepler: an extensible system for design and execution of scientific workflows. In Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004. IEEE, New York, NY, USA, 423--424.Google ScholarGoogle ScholarCross RefCross Ref
  3. Daniel Alvarez-Coello, Daniel Wilms, Adnan Bekan, and Jorge Marx Gómez. 2021. Generic semantization of vehicle data streams. In 2021 IEEE 15th International Conference on Semantic Computing (ICSC). IEEE, New York, NY, USA, 112--117.Google ScholarGoogle ScholarCross RefCross Ref
  4. Inc. Amazon Web Services. 2022. AWS Batch. Retrieved Apr 30, 2022 from https://aws.amazon.com/batch/Google ScholarGoogle Scholar
  5. Peter Amstutz, Michael R. Crusoe, Nebojša Tijanić, Brad Chapman, John Chilton, et al. 2016. Common Workflow Language, v1.0. Google ScholarGoogle ScholarCross RefCross Ref
  6. Henriette D Avram. 1968. The MARC Pilot Project. Final Report. Technical Report. Library of Congress, Washington D.C.Google ScholarGoogle Scholar
  7. Louis Bavoil, Steven P Callahan, Patricia J Crossno, Juliana Freire, Carlos E Scheidegger, et al. 2005. Vistrails: Enabling interactive multiple-view visualizations. In VIS 05. IEEE Visualization, 2005. IEEE, New York, NY, USA, 135--142.Google ScholarGoogle Scholar
  8. Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Kötter, et al. 2009. KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD explorations Newsletter 11, 1 (2009), 26--31.Google ScholarGoogle Scholar
  9. Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, et al. 2015. Collaborative Data Analytics with DataHub. Proceedings of the VLDB Endowment 8, 12 (2015), 1916--1919. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Blyth, J Alcaraz, Sébastien Binet, and Sergei V Chekanov. 2019. ProIO: An event-based I/O stream format for protobuf messages. Computer Physics Communications 241 (2019), 98--112.Google ScholarGoogle ScholarCross RefCross Ref
  11. G.H. Brimhall and A. Vanegas. 2001. Removing Science Workflow Barriers to Adoption of Digital Geologic Mapping by Using the GeoMapper Universal Program and Visual User Interface. In Digital Mapping Techniques. U.S. Geological Survey Open-File Report 01-223, Tuscaloosa, AL, USA, 103--115.Google ScholarGoogle Scholar
  12. Margaret M Burnett and David W McIntyre. 1995. Visual programming. COMPUTER-LOS ALAMITOS- 28 (1995), 14--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Charlie Catlett, William E Allcock, Phil Andrews, Ruth Aydt, Ray Bair, et al. 2008. Teragrid: Analysis of organization, system architecture, and middleware enabling new types of applications. In High performance computing and grids in action. IOS Press BV, Amsterdam, Netherlands, 225--249.Google ScholarGoogle Scholar
  14. Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, et al. 2016. Benchmarking streaming computation engines: Storm, flink and spark streaming. In IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, New York, NY, USA, 1789--1792.Google ScholarGoogle Scholar
  15. Neuromore Co. 2022. Neuromore. Retrieved Apr 30, 2022 from https://www.neuromore.com/Google ScholarGoogle Scholar
  16. Morgan V Cundiff. 2004. An introduction to the metadata encoding and transmission standard (METS). Library Hi Tech 22, 1 (2004), 52--64. Google ScholarGoogle ScholarCross RefCross Ref
  17. Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, et al. 2015. Pegasus: a Workflow Management System for Science Automation. Future Generation Computer Systems 46 (2015), 17--35. Funding Acknowledgements: NSF ACI SDCI 0722019, NSF ACI SI2-SSI 1148515 and NSF OCI-1053575. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Demšar, T. Curk, A. Erjavec, Č. Gorup, et al. 2013. Orange: data mining toolbox in Python. The Journal of Machine Learning Research 14, 1 (2013), 2349--2353.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrew T Duchowski, Krzysztof Krejtz, Nina A Gehrer, Tanya Bafna, and Per Bækgaard. 2020. The Low/High Index of Pupillary Activity. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michel Dumontier, Alasdair JG Gray, M Scott Marshall, Vladimir Alexiev, Peter Ansell, et al. 2016. The health care and life sciences community profile for dataset descriptions. PeerJ 4 (2016), e2331.Google ScholarGoogle ScholarCross RefCross Ref
  21. Gabriel Dzodom, Akshay Kulkarni, Catherine C Marshall, and Frank M Shipman. 2020. Keeping People Playing: The Effects of Domain News Presentation on Player Engagement in Educational Prediction Games. In Proceedings of the 31st ACM Conference on Hypertext and Social Media. Association for Computing Machinery, New York, NY, USA, 47--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. FAIRsharing. 2022. FAIRSharing Standards. Retrieved Apr 30, 2022 from https://fairsharing.org/summary-statistics/Google ScholarGoogle Scholar
  23. Eric H Fegraus, Sandy Andelman, Matthew B Jones, and Mark Schildhauer. 2005. Maximizing the value of ecological data with structured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. The Bulletin of the Ecological Society of America 86, 3 (2005), 158--168.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ian Foster, Yong Zhao, Ioan Raicu, and Shiyong Lu. 2008. Cloud computing and grid computing 360-degree compared. In 2008 grid computing environments workshop. IEEE, New York, NY, USA, 1--10.Google ScholarGoogle Scholar
  25. OpenJS Foundation. 2022. Node-RED. Retrieved Apr 30, 2022 from https://nodered.org/Google ScholarGoogle Scholar
  26. Mary Frank Fox. 1992. Research, teaching, and publication productivity: Mutuality versus competition in academia. Sociology of education 65 (1992), 293--305.Google ScholarGoogle Scholar
  27. GitHub. 2022. Lab streaming layer (LSL). Retrieved Apr 30, 2022 from https://github.com/sccn/labstreaminglayer/Google ScholarGoogle Scholar
  28. Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, et al. 2020. FAIR computational workflows. Data Intelligence 2, 1--2 (2020), 108--121.Google ScholarGoogle ScholarCross RefCross Ref
  29. Krzysztof Gorgolewski, Christopher D Burns, Cindee Madison, Dav Clark, Yaroslav O Halchenko, et al. 2011. Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics 5 (2011), 13.Google ScholarGoogle Scholar
  30. Urs Hunkeler, Hong Linh Truong, and Andy Stanford-Clark. 2008. MQTT-S-A publish/subscribe protocol for Wireless Sensor Networks. In 2008 3rd International Conference on Communication Systems Software and Middleware and Workshops (COMSWARE'08). IEEE, New York, NY, USA, 791--798.Google ScholarGoogle ScholarCross RefCross Ref
  31. Syntrogi Inc. 2022. NeuroPype. Retrieved Apr 30, 2022 from https://www.neuropype.io/Google ScholarGoogle Scholar
  32. John PA Ioannidis, David B Allison, Catherine A Ball, Issa Coulibaly, Xiangqin Cui, et al. 2009. Repeatability of published microarray gene expression analyses. Nature genetics 41, 2 (2009), 149--155.Google ScholarGoogle Scholar
  33. Haruna Isah, Tariq Abughofa, Sazia Mahfuz, Dharmitha Ajerla, Farhana Zulkernine, et al. 2019. A survey of distributed data stream processing frameworks. IEEE Access 7 (2019), 154300--154316.Google ScholarGoogle ScholarCross RefCross Ref
  34. Yasith Jayawardana and Sampath Jayarathna. 2019. DFS: a dataset file system for data discovering users. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, New York, NY, USA, 355--356.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yasith Jayawardana and Sampath Jayarathna. 2020. Streaming Analytics and Workflow Automation for DFS. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. Association for Computing Machinery, New York, NY, USA, 513--514.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Gavindya Jayawardena, Anne Michalek, Andrew Duchowski, and Sampath Jayarathna. 2020. Pilot study of audiovisual speech-in-noise (sin) performance of young adults with adhd. In ACM Symposium on Eye Tracking Research and Applications. Association for Computing Machinery, New York, NY, USA, 1--5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Daniel Keim, Gennady Andrienko, Jean-Daniel Fekete, Carsten Görg, Jörn Kohlhammer, et al. 2008. Visual analytics: Definition, process, and challenges. In Information visualization. Springer, Cham, Switzerland, 154--175.Google ScholarGoogle Scholar
  38. Gary King. 2007. An introduction to the dataverse network as an infrastructure for data sharing. Sociological Methods & Research 36, 2 (2007), 173--199.Google ScholarGoogle ScholarCross RefCross Ref
  39. John Kunze and Thomas Baker. 2007. The Dublin core metadata element set. RFC 5013. Internet Engineering Task Force.Google ScholarGoogle Scholar
  40. Carl Lagoze, Herbert Van de Sompel, Michael L. Nelson, Simeon Warner, Robert Sanderson, et al. 2008. Object Re-Use & Exchange: A Resource-Centric Approach. arXiv:0804.2273 [cs.DL]Google ScholarGoogle Scholar
  41. Mihail Halatchev Le Gruenwald. 2005. Estimating missing values in related sensor data streams. Technical Report. University of Oklahoma.Google ScholarGoogle Scholar
  42. Edward A Lee and Steve Neuendorffer. 2000. MoML --- A Modeling Markup Language in XML - Version 0.4. Technical Report. University of California at Berkeley.Google ScholarGoogle Scholar
  43. Chee Sun Liew, Malcolm P Atkinson, Michelle Galea, Tan Fong Ang, Paul Martin, et al. 2016. Scientific workflows: moving across paradigms. ACM Computing Surveys (CSUR) 49, 4 (2016), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara. 2011. Galileo: A framework for distributed storage of high-throughput data streams. In 2011 Fourth IEEE International Conference on Utility and Cloud Computing. IEEE, New York, NY, USA, 17--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Paul Morrison. 2010. Flow-Based Programming, 2nd Edition: A New Approach to Application Development. CreateSpace, Scotts Valley, CA.Google ScholarGoogle Scholar
  46. Hamid Nasiri, Saeed Nasehi, and Maziar Goudarzi. 2019. Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities. Journal of Big Data 6, 1 (2019), 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  47. Anton Nekrutenko and James Taylor. 2012. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nature Reviews Genetics 13, 9 (2012), 667--672.Google ScholarGoogle ScholarCross RefCross Ref
  48. Roger D Peng. 2011. Reproducible research in computational science. Science 334, 6060 (2011), 1226--1227. Google ScholarGoogle ScholarCross RefCross Ref
  49. Ruth Pordes, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, et al. 2007. The open science grid. Journal of Physics: Conference Series 78 (July 2007), 012057. Google ScholarGoogle ScholarCross RefCross Ref
  50. Florian Prinz, Thomas Schlange, and Khusru Asadullah. 2011. Believe it or not: how much can we rely on published data on potential drug targets? Nature reviews Drug discovery 10, 9 (2011), 712--712.Google ScholarGoogle Scholar
  51. Dominique G Roche, Loeske EB Kruuk, Robert Lanfear, and Sandra A Binning. 2015. Public data archiving in ecology and evolution: how well are we doing? PLoS Biol 13, 11 (2015), e1002295.Google ScholarGoogle ScholarCross RefCross Ref
  52. David Rosenboom and Tim Mullen. 2019. More than one - Artistic explorations with multi-agent BCIs. In Brain Art. Springer, Cham, Switzerland, 117--143.Google ScholarGoogle Scholar
  53. Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 symposium on Eye tracking research & applications. Association for Computing Machinery, New York, NY, USA, 71--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. Ten simple rules for reproducible computational research. PLoS computational biology 9, 10 (2013), e1003285.Google ScholarGoogle Scholar
  55. Susanna-Assunta Sansone, Peter McQuilton, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Massimiliano Izzo, et al. 2019. FAIRsharing as a community approach to standards, repositories and policies. Nature biotechnology 37, 4 (2019), 358--367.Google ScholarGoogle Scholar
  56. Arvind Satyanarayan, Ryan Russell, Jane Hoffswell, and Jeffrey Heer. 2015. Reactive Vega: A Streaming DataFlow Architecture for Declarative Interactive Visualization. IEEE transactions on visualization and computer graphics 22, 1 (2015), 659--668.Google ScholarGoogle Scholar
  57. Marcus Sen and Tim Duffy. 2005. GeoSciML: development of a generic geoscience markup language. Computers & geosciences 31, 9 (2005), 1095--1103.Google ScholarGoogle Scholar
  58. Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. 2017. RIoTBench: An IoT benchmark for distributed stream processing systems. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4257. Google ScholarGoogle ScholarCross RefCross Ref
  59. R Grant Steen. 2011. Retractions in the scientific literature: is the incidence of research fraud increasing? Journal of medical ethics 37, 4 (2011), 249--253.Google ScholarGoogle ScholarCross RefCross Ref
  60. Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience. Concurrency and computation: practice and experience 17, 2--4 (2005), 323--356.Google ScholarGoogle Scholar
  61. Mike Thelwall and Kayvan Kousha. 2016. Figshare: a universal repository for academic resource sharing? Emerald Group Publishing Limited, UK.Google ScholarGoogle Scholar
  62. John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, et al. 2014. XSEDE: accelerating scientific discovery. Computing in science & engineering 16, 5 (2014), 62--74.Google ScholarGoogle Scholar
  63. Giselle Van Dongen and Dirk Van den Poel. 2020. Evaluation of stream processing frameworks. IEEE Transactions on Parallel and Distributed Systems 31, 8 (2020), 1845--1858.Google ScholarGoogle ScholarCross RefCross Ref
  64. Tim Van Mourik, Lukas Snoek, Tomas Knapen, and David G Norris. 2018. Porcupine: a visual pipeline tool for neuroimaging analysis. PLoS computational biology 14, 5 (2018), e1006064.Google ScholarGoogle Scholar
  65. Gregor Von Laszewski, Geoffrey C Fox, Fugang Wang, Andrew J Younge, Archit Kulshrestha, et al. 2010. Design of the futuregrid experiment management framework. In 2010 Gateway Computing Environments Workshop (GCE). IEEE, New York, NY, USA, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  66. Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3 (2016), 160018.Google ScholarGoogle Scholar
  67. Keiichi Yasumoto, Hirozumi Yamaguchi, and Hiroshi Shigeno. 2016. Survey of real-time processing technologies of iot data streams. Journal of Information Processing 24, 2 (2016), 195--202.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. StreamingHub: interactive stream analysis workflows

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
          June 2022
          392 pages
          ISBN:9781450393454
          DOI:10.1145/3529372
          • General Chairs:
          • Akiko Aizawa,
          • Thomas Mandl,
          • Zeljko Carevic,
          • Program Chairs:
          • Annika Hinze,
          • Philipp Mayr,
          • Philipp Schaer

          Copyright © 2022 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2022

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          JCDL '22 Paper Acceptance Rate35of132submissions,27%Overall Acceptance Rate415of1,482submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader