Skip to main content
Log in

A Run-time System for Efficient Execution of Scientific Workflows on Distributed Environments

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Scientific workflow systems have been introduced in response to the demand of researchers from several domains of science who need to process and analyze increasingly larger datasets. The design of these systems is largely based on the observation that data analysis applications can be composed as pipelines or networks of computations on data. In this work, we present a run-time support system that is designed to facilitate this type of computation in distributed computing environments. Our system is optimized for data-intensive workflows, in which efficient management and retrieval of data, coordination of data processing and data movement, and check-pointing of intermediate results are critical and challenging issues. Experimental evaluation of our system shows that linear speedups can be achieved for sophisticated applications, which are implemented as a network of multiple data processing components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. CERN: Large hadron collider (lhc)-http://www.interactions.org/lhc/

  2. Tatebe, O., Morita, Y., Matsuoka, S., Soda, N., Sekiguchi, S.: Grid datafarm architecture for petascale data intensive computing. In: 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid) (2002)

  3. Kola, G., Kosar, T., Frey, J., Livny, M., Brunner, R.J., Remijan, M.: Disc: a system for distributed data intensive scientific computing. In: Proceeding of the First Workshop on Real, Large Distributed Systems (WORLDS’04). San Francisco, CA (2004)

  4. Hastings, S., Ribeiro, M., Langella, S., Oster, S., Catalyurek, U., Pan, T., Huang, K., Ferreira, R., Saltz, J., Kurc, T.: Xml database support for distributed execution of data-intensive scientific workflows. SIGMOD Record 34, 2005

  5. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludscher, B., Mock, S.: Kepler: An extensible system for design and execution of scientific workflows. In: The 16th International Conference on Scientific and Statistical Database Management(SSDBM). Santorini Island, Greece (2004)

  6. Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific workflow management and the kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows (2005)

  7. Lee, E.A., Parks, T.M.: Dataflow process networks. In: Proceedings of the IEEE, pp. 773–799 (1995)

  8. Ferreira, R., Meira, W. Jr., Guedes, D., Drummond, L., Coutinho, B., Teodoro, G., Tavares, T., Araujo, R., Ferreira, G.: Anthill: A scalable run-time environment for data mining applications. In: Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD) (2005)

  9. Acharya, A., Uysal, M., Saltz, J.: Active disks: Programming model, algorithms and evaluation. In: Eighth International Conference on Architectural Support for Programming Languages and Operations Systems (ASPLOS VIII), pp. 81–91 (1998)

  10. Hastings, S., Langella, S., Oster, S., Saltz, J.: Distributed data management and integration framework: The mobius project. In: Global Grid Forum 11 (GGF11) Semantic Grid Applications Workshop, pp. 20–38. IEEE Computer Society (2004)

  11. Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: The 14th International Conference on Scientific and Statistical Database Management (SSDBM’02) (2002)

  12. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Lazzarini, A., Arbree, A., Cavanaugh, R., Koranda, S.: Mapping abstract complex workflows onto grid environments. J. Grid Comput. 25–39 (2003)

  13. Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S. Condor-G: A computation management agent for multi-institutional grids. In: Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10). IEEE Press (2001)

  14. PTOLEMYII project, Department of EECS, US Berkeley http://ptolemy.eecs.berkeley.edu/ptolemyII/ (2004)

  15. Casanova, H., Dongarra, J.: Netsolve: A network enabled server for solving computational science problems. In: International Journal of Supercomputer, pp. 212–223 (1997)

  16. Berglund, A., Boag, S., Chamberlim, D., Fernández, M.F., Kay, M., Robie, J., Siméon, J.: Xml path language (xpath). In: World Wide Web Consortium (W3C) (2003)

  17. Pan, T.C., Huang, K.: Virtual mouse placenta: Tissue layer segmentation. In: Proceedings of the 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2005) (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Teodoro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teodoro, G., Tavares, T., Ferreira, R. et al. A Run-time System for Efficient Execution of Scientific Workflows on Distributed Environments. Int J Parallel Prog 36, 250–266 (2008). https://doi.org/10.1007/s10766-007-0068-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-007-0068-8

Keywords

Navigation