skip to main content
research-article

Storage optimization for large-scale distributed stream-processing systems

Published:25 February 2008Publication History
Skip Abstract Section

Abstract

We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, both incoming data and intermediate results may need to be stored to enable analyses at unknown future times. The quantity of data of potential use would dominate even the largest storage system. Thus, a mechanism is needed to keep the data most likely to be used. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time in a prespecified way [Douglis et al.2004]. Storage space for data entering the system is reclaimed automatically by deleting data of the lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming stream of data presents a challenge. In this article we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system. The key aspects of such a scheme are quite different from those that arise in traditional file assignment problems. We further motivate this optimization problem and describe a solution, comparing its performance to other reasonable schemes via simulation experiments.

References

  1. Abadi, D. J., Ahmad, Y., Balazinska, M., Centintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A. S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S. 2005. The design of the Borealis stream processing engine. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR).Google ScholarGoogle Scholar
  2. Abd-El-Malek, M., II, W. V. C., Cranor, C., Ganger, G. R., Hendricks, J., Klosterman, A. J., Mesnier, M., Prasad, M., Salmon, B., Sambasivan, R. R., Sinnamohideen, S., Strunk, J. D., Thereska, E., Wachs, M., and Wylie, J. J. 2005. Ursa minor: Versatile cluster-based storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ahuja, R., Magnanti, T., and Orlin, J. 1993. Network Flows. Prentice Hall.Google ScholarGoogle Scholar
  4. Alvarez, G. A., Borowsky, E., Go, S., Romer, T. H., Becker-Szendy, R., Golding, R., Merchant, A., Spasojevic, M., Veitch, A., and Wilkes, J. 2001. Minerva: An automated resource provisioning tool for large-scale storage systems. ACM Trans. Comput. Syst. 19, 4, 483--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Amini, L., Jain, N., Sehgal, A., Silber, J., and Verscheure, O. 2006. Adaptive control of extreme-scale stream processing systems. In Proceedings of IEEE International Conference on Distributed Computing Systems (ICDCS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bent, J., Thain, D., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Livny, M. 2004. Explicit control in the batch-aware distributed file system. In Proceedings of the ACMIUSENIX Symposium on Networked System Design and Implementation (NSDI). 365--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bertsimas, D. and Tsitsiklis, J. 1997. Introduction to Linear Optimization. Athena Scientific. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bhagwan, R., Douglis, F., Hildrum, K., Kephart, J. O., and Walsh, W. E. 2005. Time-Varying management of data storage. In 1st Workshop on Hot Topics in System Dependability. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Branson, M., Douglis, F., Fawcett, B., Liu, Z., Riabov, A., and Ye, F. 2007. Autonomic operations in cooperative stream processing systems. In Proceedings of the 2nd Workshop on Hot Topics in Autonomic Computing. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S. R., Reiss, F., and Shah, M. A. 2003. TelegraphCQ: Continuous dataflow processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 668--668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dabek, F., Kaashoek, M. F., Karger, D., Morris, R., and Stoica, I. 2001. Wide-Area cooperative storage with CFS. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), Chateau Lake Louise, Banff, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Douglis, F., Branson, M., Hildrum, K., Rong, B., and Ye, F. 2006. Multi-Site cooperative data stream analysis. Oper. Syst. Rev. 40, 3, 31--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Douglis, F., Palmer, J., Richards, E. S., Tao, D., Tetzlaff, W. H., Tracey, J. M., and Yin, J. 2004. Position: Short object lifetimes require a delete-optimized storage system. In Proceedings of the 11th ACM SIGOPS European Workshop. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dowdy, L. W. and Foster, D. V. 1982. Comparative models of the file assignment problem. ACM Comput. Surv. 14, 2, 287--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Forrest, J. 2006. CLP- COIN-OR linear program solver. http://www.coin-or.org/Clp/index.html.Google ScholarGoogle Scholar
  16. Graham, R. 1969. Bounds on multiprocessor timing anomalies. SIAM J. Appl. Math. 17, 2.Google ScholarGoogle ScholarCross RefCross Ref
  17. Hildrum, K., Douglis, F., Fleischer, L., Katta, A., Wolf, J. L., and Yu, P. S. 2006. Storage optimization for large-scale distributed stream processing systems. In Workshop on System Management Tools for Large-Scale Parallel Systems.Google ScholarGoogle Scholar
  18. Hunter, D. 1980. Modeling real dasd configurations. IBM Res. Rep. RC 8606.Google ScholarGoogle Scholar
  19. Jacques-Silva, G., Challenger, J., Degenaro, L., Giles, J., and Wagle. 2007. Towards autonomic fault recovery in system-S. In Proceedings of the 4th IEEE International Conference on Autonomic Computing. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Knuth, D. E. 1973. The Art of Computer Programming, Volume 3. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lavenberg, S., Ed. 1983. Computer Performance Modeling Handbook. Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lee, L.-W., Scheuermann, P., and Vingralek, R. 2000. File assignment in parallel I/O systems with minimal variance of service time. IEEE Trans. Comput. 49, 2, 127--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lougee-Heimer, R., Barahona, F., Dietrich, B., Fasano, J. P., Forrest, J., Harder, R., Ladanyi, L., Pfender, T., Ralphs, T., Saltzman, M., and Schienberg, K. 2001. The COIN-OR initiative: Open-Source software accelerates operations research progress. ORMS Today 28, 5, 20--22.Google ScholarGoogle Scholar
  24. March, S. and Rho, S. 1995. Allocating data and operations to nodes in distributed database design. IEEE Trans. Knowl. Data Eng. 7, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Pattipati, K., Wolf, J., and Deb, S. 1992. A calculus of variations approach to file allocation problems in computer systems. In Proceedings of the ACM Sigmetrics Joint International Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Perez-Davila, A. and Dowdy, L. 1984. Parameter interdependencies of file placement models in a Unix system. In Proceedings of the ACM Sigmetrics Joint International Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Pietzuch, P., Ledlie, J., Shneidman, J., Roussopoulos, M., Welsh, M., and Seltzer, M. 2006. Network-Aware operator placement for stream-processing systems. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Repantis., T., Gu., X., and Kalogeraki, V. 2006. Synergy: Sharing-Aware component composition for distributed stream processing systems. In Proceedings of the ACM/IFIP/USENIX 7th International Middleware Conference, 322--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rowstron, A. and Druschel, P. 2001. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP). ACM Press, New York, 188--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Stonebraker, M., Çetintemel, U., and Zdonik, S. B. 2005. The 8 requirements of real-time stream processing. SIGMOD Rec. 34, 4, 42--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Streambase Systems. 2007. Streambase. http://www.streambase.com/.Google ScholarGoogle Scholar
  32. The STREAM Group. 2003. STREAM: The Stanford stream data manager. IEEE Data Eng. Bull. 26, 1.Google ScholarGoogle Scholar
  33. Wah, B. 1984. File placement on distributed computer systems. Comput. 17, 1, 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wolf, J. 1989. The placement optimization problem. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems and Performance. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zdonik, S., Stonebraker, M., Cherniack, M., Cetintemel, U., Balazinska, M., and Balakrishnan, H. 2003. The Aurora and Medusa projects. IEEE Data Eng. Bull. 26, 1.Google ScholarGoogle Scholar
  36. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google ScholarGoogle Scholar

Index Terms

  1. Storage optimization for large-scale distributed stream-processing systems

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Storage
              ACM Transactions on Storage  Volume 3, Issue 4
              February 2008
              156 pages
              ISSN:1553-3077
              EISSN:1553-3093
              DOI:10.1145/1326542
              Issue’s Table of Contents

              Copyright © 2008 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 25 February 2008
              • Revised: 1 October 2007
              • Accepted: 1 October 2007
              • Received: 1 May 2007
              Published in tos Volume 3, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Pre-selected

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader