ABSTRACT
Reusable data/code and reproducible analyses are foundational to quality research. This aspect, however, is often overlooked when designing interactive stream analysis workflows for time-series data (e.g., eye-tracking data). A mechanism to transmit informative metadata alongside data may allow such workflows to intelligently consume data, propagate metadata to downstream tasks, and thereby auto-generate reusable, reproducible analytic outputs with zero supervision. Moreover, a visual programming interface to design, develop, and execute such workflows may allow rapid prototyping for interdisciplinary research. Capitalizing on these ideas, we propose StreamingHub, a framework to build metadata propagating, interactive stream analysis workflows using visual programming. We conduct two case studies to evaluate the generalizability of our framework. Simultaneously, we use two heuristics to evaluate their computational fluidity and data growth. Results show that our framework generalizes to multiple tasks with a minimal performance overhead.
- Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, et al. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Proceedings of the VLDB Endowment 8 (2015), 1792--1803.Google ScholarDigital Library
- Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludascher, et al. 2004. Kepler: an extensible system for design and execution of scientific workflows. In Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004. IEEE, New York, NY, USA, 423--424.Google ScholarCross Ref
- Daniel Alvarez-Coello, Daniel Wilms, Adnan Bekan, and Jorge Marx Gómez. 2021. Generic semantization of vehicle data streams. In 2021 IEEE 15th International Conference on Semantic Computing (ICSC). IEEE, New York, NY, USA, 112--117.Google ScholarCross Ref
- Inc. Amazon Web Services. 2022. AWS Batch. Retrieved Apr 30, 2022 from https://aws.amazon.com/batch/Google Scholar
- Peter Amstutz, Michael R. Crusoe, Nebojša Tijanić, Brad Chapman, John Chilton, et al. 2016. Common Workflow Language, v1.0. Google ScholarCross Ref
- Henriette D Avram. 1968. The MARC Pilot Project. Final Report. Technical Report. Library of Congress, Washington D.C.Google Scholar
- Louis Bavoil, Steven P Callahan, Patricia J Crossno, Juliana Freire, Carlos E Scheidegger, et al. 2005. Vistrails: Enabling interactive multiple-view visualizations. In VIS 05. IEEE Visualization, 2005. IEEE, New York, NY, USA, 135--142.Google Scholar
- Michael R Berthold, Nicolas Cebron, Fabian Dill, Thomas R Gabriel, Tobias Kötter, et al. 2009. KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD explorations Newsletter 11, 1 (2009), 26--31.Google Scholar
- Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, et al. 2015. Collaborative Data Analytics with DataHub. Proceedings of the VLDB Endowment 8, 12 (2015), 1916--1919. Google ScholarDigital Library
- David Blyth, J Alcaraz, Sébastien Binet, and Sergei V Chekanov. 2019. ProIO: An event-based I/O stream format for protobuf messages. Computer Physics Communications 241 (2019), 98--112.Google ScholarCross Ref
- G.H. Brimhall and A. Vanegas. 2001. Removing Science Workflow Barriers to Adoption of Digital Geologic Mapping by Using the GeoMapper Universal Program and Visual User Interface. In Digital Mapping Techniques. U.S. Geological Survey Open-File Report 01-223, Tuscaloosa, AL, USA, 103--115.Google Scholar
- Margaret M Burnett and David W McIntyre. 1995. Visual programming. COMPUTER-LOS ALAMITOS- 28 (1995), 14--14.Google ScholarDigital Library
- Charlie Catlett, William E Allcock, Phil Andrews, Ruth Aydt, Ray Bair, et al. 2008. Teragrid: Analysis of organization, system architecture, and middleware enabling new types of applications. In High performance computing and grids in action. IOS Press BV, Amsterdam, Netherlands, 225--249.Google Scholar
- Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, et al. 2016. Benchmarking streaming computation engines: Storm, flink and spark streaming. In IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, New York, NY, USA, 1789--1792.Google Scholar
- Neuromore Co. 2022. Neuromore. Retrieved Apr 30, 2022 from https://www.neuromore.com/Google Scholar
- Morgan V Cundiff. 2004. An introduction to the metadata encoding and transmission standard (METS). Library Hi Tech 22, 1 (2004), 52--64. Google ScholarCross Ref
- Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, et al. 2015. Pegasus: a Workflow Management System for Science Automation. Future Generation Computer Systems 46 (2015), 17--35. Funding Acknowledgements: NSF ACI SDCI 0722019, NSF ACI SI2-SSI 1148515 and NSF OCI-1053575. Google ScholarDigital Library
- J. Demšar, T. Curk, A. Erjavec, Č. Gorup, et al. 2013. Orange: data mining toolbox in Python. The Journal of Machine Learning Research 14, 1 (2013), 2349--2353.Google ScholarDigital Library
- Andrew T Duchowski, Krzysztof Krejtz, Nina A Gehrer, Tanya Bafna, and Per Bækgaard. 2020. The Low/High Index of Pupillary Activity. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1--12.Google ScholarDigital Library
- Michel Dumontier, Alasdair JG Gray, M Scott Marshall, Vladimir Alexiev, Peter Ansell, et al. 2016. The health care and life sciences community profile for dataset descriptions. PeerJ 4 (2016), e2331.Google ScholarCross Ref
- Gabriel Dzodom, Akshay Kulkarni, Catherine C Marshall, and Frank M Shipman. 2020. Keeping People Playing: The Effects of Domain News Presentation on Player Engagement in Educational Prediction Games. In Proceedings of the 31st ACM Conference on Hypertext and Social Media. Association for Computing Machinery, New York, NY, USA, 47--52.Google ScholarDigital Library
- FAIRsharing. 2022. FAIRSharing Standards. Retrieved Apr 30, 2022 from https://fairsharing.org/summary-statistics/Google Scholar
- Eric H Fegraus, Sandy Andelman, Matthew B Jones, and Mark Schildhauer. 2005. Maximizing the value of ecological data with structured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. The Bulletin of the Ecological Society of America 86, 3 (2005), 158--168.Google ScholarCross Ref
- Ian Foster, Yong Zhao, Ioan Raicu, and Shiyong Lu. 2008. Cloud computing and grid computing 360-degree compared. In 2008 grid computing environments workshop. IEEE, New York, NY, USA, 1--10.Google Scholar
- OpenJS Foundation. 2022. Node-RED. Retrieved Apr 30, 2022 from https://nodered.org/Google Scholar
- Mary Frank Fox. 1992. Research, teaching, and publication productivity: Mutuality versus competition in academia. Sociology of education 65 (1992), 293--305.Google Scholar
- GitHub. 2022. Lab streaming layer (LSL). Retrieved Apr 30, 2022 from https://github.com/sccn/labstreaminglayer/Google Scholar
- Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, et al. 2020. FAIR computational workflows. Data Intelligence 2, 1--2 (2020), 108--121.Google ScholarCross Ref
- Krzysztof Gorgolewski, Christopher D Burns, Cindee Madison, Dav Clark, Yaroslav O Halchenko, et al. 2011. Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics 5 (2011), 13.Google Scholar
- Urs Hunkeler, Hong Linh Truong, and Andy Stanford-Clark. 2008. MQTT-S-A publish/subscribe protocol for Wireless Sensor Networks. In 2008 3rd International Conference on Communication Systems Software and Middleware and Workshops (COMSWARE'08). IEEE, New York, NY, USA, 791--798.Google ScholarCross Ref
- Syntrogi Inc. 2022. NeuroPype. Retrieved Apr 30, 2022 from https://www.neuropype.io/Google Scholar
- John PA Ioannidis, David B Allison, Catherine A Ball, Issa Coulibaly, Xiangqin Cui, et al. 2009. Repeatability of published microarray gene expression analyses. Nature genetics 41, 2 (2009), 149--155.Google Scholar
- Haruna Isah, Tariq Abughofa, Sazia Mahfuz, Dharmitha Ajerla, Farhana Zulkernine, et al. 2019. A survey of distributed data stream processing frameworks. IEEE Access 7 (2019), 154300--154316.Google ScholarCross Ref
- Yasith Jayawardana and Sampath Jayarathna. 2019. DFS: a dataset file system for data discovering users. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, New York, NY, USA, 355--356.Google ScholarDigital Library
- Yasith Jayawardana and Sampath Jayarathna. 2020. Streaming Analytics and Workflow Automation for DFS. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. Association for Computing Machinery, New York, NY, USA, 513--514.Google ScholarDigital Library
- Gavindya Jayawardena, Anne Michalek, Andrew Duchowski, and Sampath Jayarathna. 2020. Pilot study of audiovisual speech-in-noise (sin) performance of young adults with adhd. In ACM Symposium on Eye Tracking Research and Applications. Association for Computing Machinery, New York, NY, USA, 1--5.Google ScholarDigital Library
- Daniel Keim, Gennady Andrienko, Jean-Daniel Fekete, Carsten Görg, Jörn Kohlhammer, et al. 2008. Visual analytics: Definition, process, and challenges. In Information visualization. Springer, Cham, Switzerland, 154--175.Google Scholar
- Gary King. 2007. An introduction to the dataverse network as an infrastructure for data sharing. Sociological Methods & Research 36, 2 (2007), 173--199.Google ScholarCross Ref
- John Kunze and Thomas Baker. 2007. The Dublin core metadata element set. RFC 5013. Internet Engineering Task Force.Google Scholar
- Carl Lagoze, Herbert Van de Sompel, Michael L. Nelson, Simeon Warner, Robert Sanderson, et al. 2008. Object Re-Use & Exchange: A Resource-Centric Approach. arXiv:0804.2273 [cs.DL]Google Scholar
- Mihail Halatchev Le Gruenwald. 2005. Estimating missing values in related sensor data streams. Technical Report. University of Oklahoma.Google Scholar
- Edward A Lee and Steve Neuendorffer. 2000. MoML --- A Modeling Markup Language in XML - Version 0.4. Technical Report. University of California at Berkeley.Google Scholar
- Chee Sun Liew, Malcolm P Atkinson, Michelle Galea, Tan Fong Ang, Paul Martin, et al. 2016. Scientific workflows: moving across paradigms. ACM Computing Surveys (CSUR) 49, 4 (2016), 1--39.Google ScholarDigital Library
- Matthew Malensek, Sangmi Lee Pallickara, and Shrideep Pallickara. 2011. Galileo: A framework for distributed storage of high-throughput data streams. In 2011 Fourth IEEE International Conference on Utility and Cloud Computing. IEEE, New York, NY, USA, 17--24.Google ScholarDigital Library
- J. Paul Morrison. 2010. Flow-Based Programming, 2nd Edition: A New Approach to Application Development. CreateSpace, Scotts Valley, CA.Google Scholar
- Hamid Nasiri, Saeed Nasehi, and Maziar Goudarzi. 2019. Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities. Journal of Big Data 6, 1 (2019), 1--24.Google ScholarCross Ref
- Anton Nekrutenko and James Taylor. 2012. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nature Reviews Genetics 13, 9 (2012), 667--672.Google ScholarCross Ref
- Roger D Peng. 2011. Reproducible research in computational science. Science 334, 6060 (2011), 1226--1227. Google ScholarCross Ref
- Ruth Pordes, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, et al. 2007. The open science grid. Journal of Physics: Conference Series 78 (July 2007), 012057. Google ScholarCross Ref
- Florian Prinz, Thomas Schlange, and Khusru Asadullah. 2011. Believe it or not: how much can we rely on published data on potential drug targets? Nature reviews Drug discovery 10, 9 (2011), 712--712.Google Scholar
- Dominique G Roche, Loeske EB Kruuk, Robert Lanfear, and Sandra A Binning. 2015. Public data archiving in ecology and evolution: how well are we doing? PLoS Biol 13, 11 (2015), e1002295.Google ScholarCross Ref
- David Rosenboom and Tim Mullen. 2019. More than one - Artistic explorations with multi-agent BCIs. In Brain Art. Springer, Cham, Switzerland, 117--143.Google Scholar
- Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 symposium on Eye tracking research & applications. Association for Computing Machinery, New York, NY, USA, 71--78.Google ScholarDigital Library
- Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. Ten simple rules for reproducible computational research. PLoS computational biology 9, 10 (2013), e1003285.Google Scholar
- Susanna-Assunta Sansone, Peter McQuilton, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Massimiliano Izzo, et al. 2019. FAIRsharing as a community approach to standards, repositories and policies. Nature biotechnology 37, 4 (2019), 358--367.Google Scholar
- Arvind Satyanarayan, Ryan Russell, Jane Hoffswell, and Jeffrey Heer. 2015. Reactive Vega: A Streaming DataFlow Architecture for Declarative Interactive Visualization. IEEE transactions on visualization and computer graphics 22, 1 (2015), 659--668.Google Scholar
- Marcus Sen and Tim Duffy. 2005. GeoSciML: development of a generic geoscience markup language. Computers & geosciences 31, 9 (2005), 1095--1103.Google Scholar
- Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. 2017. RIoTBench: An IoT benchmark for distributed stream processing systems. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4257. Google ScholarCross Ref
- R Grant Steen. 2011. Retractions in the scientific literature: is the incidence of research fraud increasing? Journal of medical ethics 37, 4 (2011), 249--253.Google ScholarCross Ref
- Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience. Concurrency and computation: practice and experience 17, 2--4 (2005), 323--356.Google Scholar
- Mike Thelwall and Kayvan Kousha. 2016. Figshare: a universal repository for academic resource sharing? Emerald Group Publishing Limited, UK.Google Scholar
- John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, et al. 2014. XSEDE: accelerating scientific discovery. Computing in science & engineering 16, 5 (2014), 62--74.Google Scholar
- Giselle Van Dongen and Dirk Van den Poel. 2020. Evaluation of stream processing frameworks. IEEE Transactions on Parallel and Distributed Systems 31, 8 (2020), 1845--1858.Google ScholarCross Ref
- Tim Van Mourik, Lukas Snoek, Tomas Knapen, and David G Norris. 2018. Porcupine: a visual pipeline tool for neuroimaging analysis. PLoS computational biology 14, 5 (2018), e1006064.Google Scholar
- Gregor Von Laszewski, Geoffrey C Fox, Fugang Wang, Andrew J Younge, Archit Kulshrestha, et al. 2010. Design of the futuregrid experiment management framework. In 2010 Gateway Computing Environments Workshop (GCE). IEEE, New York, NY, USA, 1--10.Google ScholarCross Ref
- Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3 (2016), 160018.Google Scholar
- Keiichi Yasumoto, Hirozumi Yamaguchi, and Hiroshi Shigeno. 2016. Survey of real-time processing technologies of iot data streams. Journal of Information Processing 24, 2 (2016), 195--202.Google ScholarCross Ref
Index Terms
- StreamingHub: interactive stream analysis workflows
Recommendations
Using Characteristics of Computational Science Schemas for Workflow Metadata Management
SERVICES '08: Proceedings of the 2008 IEEE Congress on Services - Part IComputational science workflows are generating an ever-increasing volume of data products. Metadata for these workflows is communicated using one or more discipline-specific schemas and is not static but instead is subject to frequent updates and ...
Metadata Management in the Taverna Workflow System
CCGRID '08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the GridThere seems to be a general consensus on the crucial role metadata can play for enhancing the functionalities of scientific workflows systems, e.g., workflow and service discovery, composition and provenance browsing, among others. However, in most ...
Managing structural genomic workflows using web services
Special issue: Biological data managementIn silico scientific experiments encompass multiple combinations of program and data resources. Each resource combination in an execution flow is called a scientific workflow. In bioinformatics environments, program composition is a frequent operation, ...
Comments