Abstract
Support for distributed application management in large-scale networked environments remains in its early stages. Although a number of solutions exist for subtasks of application deployment, monitoring, and maintenance in distributed environments, few tools provide a unified framework for application management. Many of the existing tools address the management needs of a single type of application or service that runs in a specific environment, and these tools are not adaptable enough to be used for other applications or platforms. To this end, we present the design and implementation of Plush, a fully configurable application management infrastructure designed to meet the general requirements of several different classes of distributed applications. Plush allows developers to specifically define the flow of control needed by their computations using application building blocks. Through an extensible resource management interface, Plush supports execution in a variety of environments, including both live deployment platforms and emulated clusters. Plush also uses relaxed synchronization primitives for improving fault tolerance and liveness in failure-prone environments. To gain an understanding of how Plush manages different classes of distributed applications, we take a closer look at specific applications and evaluate how Plush provides support for each.
- Adabala, S., Chadha, V., Chawla, P., Figueiredo, R., Fortes, J., Krsul, I., Matsunaga, A., Tsugawa, M., Zhang, J., Zhao, M., Zhu, L., and Zhu, X. 2005. From virtualized resources to virtual computing grids: The In-VIGO system. Future Gen. Comput. Syst. 21, 6. Google ScholarDigital Library
- Albrecht, J. 2009. Bringing big systems to small schools: Distributed systems for undergraduates. In Proceedings of the 40th ACM Technical Symposium on Computer Science Education (SIGCSE). Google ScholarDigital Library
- Albrecht, J., Braud, R., Dao, D., Topilski, N., Tuttle, C., Snoeren, A. C., and Vahdat, A. 2007. Remote control: Distributed application configuration, management, and visualization with Plush. In Proceedings of the USENIX Large Installation System Administration Conference (LISA). Google ScholarDigital Library
- Albrecht, J. and Huang, D. Y. 2010. Managing distributed applications using Gush. In Proceedings of the ICST Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, Testbed Practices Session (TridentCom).Google Scholar
- Albrecht, J., Oppenheimer, D., Patterson, D., and Vahdat, A. 2008. Design and implementation tradeoffs for wide-area resource discovery. ACM Trans. Internet Technol. 8, 4. Google ScholarDigital Library
- Albrecht, J., Tuttle, C., Snoeren, A. C., and Vahdat, A. 2006a. Loose synchronization for large-scale networked systems. In Proceedings of the USENIX Annual Technical Conference (USENIX). Google ScholarDigital Library
- Albrecht, J., Tuttle, C., Snoeren, A. C., and Vahdat, A. 2006b. PlanetLab application management using Plush. ACM Operat. Syst. Rev. 40, 1. Google ScholarDigital Library
- Andersen, D. G., Balakrishnan, H., and Kaashoek, F. 2005. Improving Web availability for clients with MONET. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Anderson, D. P. 2004. BOINC: A System for public-resource computing and storage. In Proceedings of the IEEE/ACM International Workshop on Grid Computing. Google ScholarDigital Library
- Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the ACM Symposium on Operating System Principles (SOSP). Google ScholarDigital Library
- Bavier, A., Bowman, M., Chun, B., Culler, D., Karlin, S., Muir, S., Peterson, L., Roscoe, T., Spalink, T., and Wawrzoniak, M. 2004. Operating systems support for planetary-scale network services. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Berman, F., Casanova, H., Chien, A., Cooper, K., Dail, H., Dasgupta, A., Deng, W., Dongarra, J., Johnsson, L., Kennedy, K., Koelbel, C., Liu, B., Liu, X., Mandal, A., Marin, G., Mazina, M., Mellor-Crummey, J., Mendes, C., Olugbile, A., Patel, M., Reed, D., Shi, Z., Sievert, O., Xia, H., and YarKhan, A. 2005. New grid scheduling and rescheduling methods in the GrADS project. Inter. J. Parall. Program. 33, 2--3. Google ScholarDigital Library
- Bershad, B., Zekauskas, M., and Sawdon, W. 1993. The midway distributed shared memory system. In Proceedings of the IEEE Computer Conference (COMPCON).Google Scholar
- Bricker, A., Litzkow, M., and Livny, M. 1991. Condor technical summary. Tech. rep. 1069, Computer Science Department, University of Wisconsin--Madison.Google Scholar
- Burgess, M. 1995. Cfengine: A site configuration engine. USENIX Comput. Syst. 8, 3.Google Scholar
- Catlett, C. 2002. The philosophy of TeraGrid: Building an open, extensible, distributed TeraScale facility. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid). Google ScholarDigital Library
- Chandra, R., Zeldovich, N., Sapuntzakis, C., and Lam, M. S. 2005. The collective: A cache-based system management architecture. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Chun, B. gexec. http://www.theether.org/gexec/.Google Scholar
- Coa, J., Jarvis, S., Saini, S., and Nudd, G. 2003. GridFlow: Workflow managament for grid computing. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid). Google ScholarDigital Library
- Dijkstra, E. 1968. The Structure of the “THE”-multiprogramming system. Comm. ACM 11, 5. Google ScholarDigital Library
- Foster, I. 2005. A globus toolkit primer. http://www.globus.org/toolkit/docs/4.0/key/GT4_Primer_0.6.pdf.Google Scholar
- Fox, A. and Brewer, E. 1999. Harvest, yield, and scalable tolerant systems. In Proceedings of the IEEE Workshop on Hot Topics in Operating Systems (HotOS). Google ScholarDigital Library
- Freedman, M. J., Freudenthal, E., and Mazières, D. 2004. Democratizing content publication with Coral. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Geist, G. A. and Sunderam, V. S. 1992. Network-based concurrent computing on the PVM system. Concurrency: Pract. Exper. 4, 4. Google ScholarDigital Library
- Geni 2008. http://www.geni.net.Google Scholar
- Gentzsch, W. 2001. Sun grid engine: Towards creating a compute power grid. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid). Google ScholarDigital Library
- Globus Toolkit Monitoring and Discovery System: MDS4. http://www-unix.mcs.anl.gov/~schopf/Talks/mds4SC_nov2004.ppt.Google Scholar
- Goldsack, P., Guijarro, J., Lain, A., Mecheneau, G., Murray, P., and Toft, P. 2003. SmartFrog: Configuration and automatic ignition of distributed applications. In Proceedings of the HP Openview University Association Conference (HP OVUA).Google Scholar
- Gush 2008. http://gush.cs.williams.edu/.Google Scholar
- Huebsch, R. PlanetLab application manager. http://appmanager.berkeley.intel-research.net.Google Scholar
- Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., and Yocum, K. G. 2006. Sharing networked resources with brokered leases. In Proceedings of the USENIX Annual Technical Conference (USENIX). Google ScholarDigital Library
- Jordan, H. F. 1978. A special purpose architecture for finite element analysis. In Proceedings of the International Conference on Parallel Processing (ICPP).Google Scholar
- Keahey, K., Doering, K., and Foster, I. 2004. From sandbox to playground: Dynamic virtual environments in the grid. In Proceedings of the International Workshop in Grid Computing (Grid). Google ScholarDigital Library
- Kee, Y.-S., Logothetis, D., Huang, R., Casanova, H., and Chien, A. 2005. Efficient resource description and high quality selection for virtual grids. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid). Google ScholarDigital Library
- Keleher, P., Dwarkadas, S., Cox, A. L., and Zwaenepoel, W. 1994. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter USENIX Conference (USENIX). Google ScholarDigital Library
- Killian, C., Anderson, J. W., Braud, R., Jhala, R., and Vahdat, A. 2007. Mace: Language support for building distributed systems. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarDigital Library
- Kostić, D., Rodriguez, A., Albrecht, J., and Vahdat, A. 2003. Bullet: High bandwidth data dissemination using an overlay mesh. In Proceedings of the ACM Symposium on Operating System Principles (SOSP). Google ScholarDigital Library
- Krsul, I., Ganguly, A., Zhang, J., Fortes, J. A. B., and Figueiredo, R. J. 2004. VMPlants: Providing and managing virtual machine execution environments for grid computing. In Proceedings of the Supercomputing Conference (SC). Google ScholarDigital Library
- Leiserson, C. E., Abuhamdeh, Z. S., Douglas, D. C., Feynman, C. R., Ganmukhi, M. N., Hill, J. V., Hillis, W. D., Kuszmaul, B. C., Pierre, M. A. S., Wells, D. S., Wong-Chan, M. C., Yang, S.-W., and Zak, R. 1996. The network architecture of the connection machine CM-5. J. Parall. Distrib. Comput. 33, 2. Google ScholarDigital Library
- Litzkow, M., Livny, M., and Mutka, M. 1988. Condor—A hunter of idle workstations. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS).Google Scholar
- Liu, C., Yang, L., Foster, I., and Angulo, D. 2002. Design and evaluation of a resource selection framework. In Proceedings of the IEEE Symposium on High Performance Distributed Compuuting (HPDC). Google ScholarDigital Library
- Load Sharing Facility (LSF). http://www.platform.com/Products/Platform.LSF.Family/.Google Scholar
- Ludtke, S., Baldwin, P., and Chiu, W. 1999. EMAN: Semiautomated software for high-resolution single-particle reconstructions. J. Struct. Biol. 122.Google ScholarCross Ref
- Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E. A., Tao, J., and Zhao, Y. 2005. Scientific workflow management and the Kepler system. Concurrency Computat. Pract. Exper. (Special Issue on Scientific Workflows) 18, 10. Google ScholarDigital Library
- Mao, Y. vxargs. http://dharma.cis.upenn.edu/planetlab/vxargs/.Google Scholar
- Markoff, J. and Hansell, S. 2006. Hiding in plain sight, Google seeks more power. New York Times.Google Scholar
- Maui. Maui. http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php.Google Scholar
- McNett, M., Gupta, D., Vahdat, A., and Voelker, G. M. 2007. Usher: An extensible framework for managing clusters of virtual machines. In Proceedings of the USENIX Large Installation System Administration Conference (LISA). Google ScholarDigital Library
- Message Passing Interface Forum. 1994. MPI: A message-passing interface standard. Tech. rep. UT-CS-94-230, University of Tennessee, Knoxville. Google ScholarDigital Library
- Nacar, M. A., Pierce, M., and Fox, G. C. 2004. Developing a secure grid computing environment shell engine: Containers and services. Neural Parall. Scientific Computat. 12. Google ScholarDigital Library
- Nebula 2007. http://plush.cs.williams.edu/nebula.Google Scholar
- Oppenheimer, D., Albrecht, J., Patterson, D., and Vahdat, A. 2005. Design and implementation tradeoffs for wide-area resource discovery. In Proceedings of the IEEE Symposium on High Performance Distributed Compuuting (HPDC). Google ScholarDigital Library
- Orca-ben 2008. https://ben.renci.org/.Google Scholar
- Pai, V. S., Wang, L., Park, K., Pang, R., and Peterson, L. 2003. The dark side of the Web: An open proxy's view. In Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets).Google Scholar
- Park, K. and Pai, V. S. 2004. Deploying large file transfer on an HTTP content distribution network. In Proceedings of the ACM/USENIX Workshop on Real, Large Distributed Systems (WORLDS).Google Scholar
- Park, K. and Pai, V. S. 2006. CoMon: A mostly-scalable monitoring system for PlanetLab. ACM Operat. Syst. Rev. 40, 1. Google ScholarDigital Library
- Pearlman, L., Kesselman, C., Gullapalli, S., B. F. Spencer, J., Futrelle, J., Ricker, K., Foster, I., Hubbard, P., and Severance, C. 2004. Distributed hybrid earthquake engineering experiments: Experiences with a ground-shaking grid application. In Proceedings of the IEEE Symposium on High Performance Distributed Compuuting (HPDC). Google ScholarDigital Library
- PlanetLab Geni 2008. http://groups.geni.net/geni/wiki/PlanetLab.Google Scholar
- Plush 2004. Plush. http://plush.cs.williams.edu.Google Scholar
- Portable Batch Scheduler. (PBS). http://www.altair.com/software/pbspro.htm.Google Scholar
- ProtoGeni 2008. http://www.protogeni.net/.Google Scholar
- Pu, C. and Leff, A. 1991. Epsilon-serializability. Tech. rep. CUCS-054-90, Columbia University.Google Scholar
- Raman, R., Livny, M., and Solomon, M. 2003. Policy driven heterogeneous resource co-allocation with gangmatching. In Proceedings of the IEEE Symposium on High Performance Distributed Compuuting (HPDC). Google ScholarDigital Library
- Ripeanu, M., Bowman, M., Chase, J. S., Foster, I., and Milenkovic, M. 2004. Globus and PlanetLab resource management solutions compared. In Proceedings of the IEEE Symposium on High Performance Distributed Compuuting (HPDC). Google ScholarDigital Library
- Ritchie, D. M. and Thompson, K. 1974. The UNIX Time-sharing system. Comm. ACM 17, 7. Google ScholarDigital Library
- Satopää, V., Albrecht, J., Irwin, D., and Raghavan, B. 2011. Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In Proceedings of the IEEE Workshop on Simplifying Complex Networks for Practitioners (Simplex).Google Scholar
- Scott, S. L. 1996. Synchronization and communication in the T3E multiprocessor. In Proceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- Shoykhet, A., Lange, J., and Dinda, P. 2004. Virtuoso: A system for virtual machine marketplaces. Tech. rep. NWU-CS-04-39, Department of Computer Science, Northwestern University.Google Scholar
- Terry, D. B., Theimer, M. M., Petersen, K., Demers, A. J., Spreitzer, M. J., and Hauser, C. H. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the ACM Symposium on Operating System Principles (SOSP). Google ScholarDigital Library
- Topilski, N., Albrecht, J., and Vahdat, A. 2008. Improving scalability and fault tolerance in an application management infrastructure. In Proceedings of the USENIX Workshop on Large-Scale Computing (LASCO). Google ScholarDigital Library
- Torres-Rojas, F., Ahamad, M., and Raynal, M. 1999. Timed consistency for shared distributed objects. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC). Google ScholarDigital Library
- Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostić, D., Chase, J., and Becker, D. 2002. Scalability and accuracy in a large-scale network emulator. In Proceedings of the ACM/USENIX Symposium on Operating System Design and Implementation (OSDI). Google ScholarDigital Library
- Walker, E., Minyard, T., and Boisseau, J. 2004. GridShell: A login shell for orchestrating and coordinating applications in a grid enabled environment. In Proceedings of the International Conference on Computing, Communications and Control Technologies (CCCT).Google Scholar
- Wood, T., Shenoy, P., Venkataramani, A., and Yousif, M. 2007. Black-box and gray-box strategies for virtual machine migration. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Yu, H. and Vahdat, A. 2000. Design and evaluation of a continuous consistency model for replicated services. In Proceedings of the ACM/USENIX Symposium on Operating System Design and Implementation (OSDI). Google ScholarDigital Library
- Yu, J. and Buyya, R. 2005. A taxonomy of workflow management systems for grid computing. J. Grid Computing 3, 3--4.Google ScholarCross Ref
- Zhang, X. and Schopf, J. 2004. Performance analysis of the Globus toolkit monitoring and discovery service, MDS2. In Proceedings of the International Workshop on Middleware Performance (MP).Google Scholar
Index Terms
- Distributed application configuration, management, and visualization with plush
Recommendations
PlanetLab application management using plush
Support for application deployment and monitoring in large-scale distributed systems such as PlanetLab remains in its early stages. While a number of solutions exist for specific subtasks of deployment and monitoring, these tools suffer from a lack of ...
Distributed event based challenges for systems and applications management
DEBS '11: Proceedings of the 5th ACM international conference on Distributed event-based systemIT system and application management is critical to business use of IT systems. Distributed event processing is core to application and systems management, even for applications that are not "event driven." Emerging technology like virtualization and ...
Comments