ABSTRACT
We describe the design and implementation of Disco, a distributed computing platform for MapReduce style computations on large-scale data. Disco is designed for operation in clusters of commodity server machines, and provides both a fault-tolerant scheduling and execution layer as well as a distributed and replicated storage layer. Disco is implemented in Erlang and Python; Erlang is used for the implementation of the core aspects of cluster monitoring, job management, task scheduling and distributed filesystem, while Python is used to implement the standard Disco library.
Disco has been used in production for several years at Nokia, to analyze tens of terabytes of data daily on a cluster of over 100 nodes. With a small but very functional codebase, it provides a free, proven, and effective component of a full-fledged data analytics stack.
Supplemental Material
- J. Dean, and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004. Google ScholarDigital Library
- Apache Foundation. Apache Hadoop. At http://hadoop.apache.org.Google Scholar
- The Disco Project. Disco. At http://discoproject.org.Google Scholar
- SciPy.org. Scientific Tools for Python. At http://http://www.scipy.org.Google Scholar
- The Disco Project. ODisco, an OCaml library for Disco. At https://github.com/pmundkur/odisco.Google Scholar
- D. K. Gifford, P. Jouvelot, M. Sheldon, J. O'Toole. Semantic File Systems. In ACM SIGOPS Operating Systems Review (1991), Vol 25, Issue 5, pages 16--25. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, S-T Leung. The Google File System. In 19th ACM Symposium on Operating Systems Principles, Lake George, NY, 2003. Google ScholarDigital Library
- G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica. Reining in the Outliers in Map-Reduce Clusters using Mantri. In Proceedings of the 9th USENIX conference on Operating Systems Design and Implementation, Vancouver, BC, Canada, 2010. Google ScholarDigital Library
- Basho Technologies. Riak's MapReduce. At http://wiki.basho.com/MapReduce.html.Google Scholar
- J. Chris Anderson, N. Slater, J. Lehnardt. CouchDB: The Definitive Guide. O'Reilly Media, 2010. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Eurosys'07, Lisboa, Portugal, 2007. Google ScholarDigital Library
- D. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, S. Hand. Ciel: a universal execution engine for distributed data-flow computing. In Proceedings of NSDI 2011, Boston, MA, 2011. Google ScholarDigital Library
Index Terms
- Disco: a computing platform for large-scale data analytics
Recommendations
Parallel Hierarchical Affinity Propagation with MapReduce
IC2E '14: Proceedings of the 2014 IEEE International Conference on Cloud EngineeringThe accelerated evolution and explosion of the Internet and social media is generating voluminous quantities of data (on zettabyte scales). Paramount amongst the desires to manipulate and extract actionable intelligence from vast big data volumes is the ...
Data Trasfer From MySQL To Hadoop: Implementers' Perspective
ICTCS '14: Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive StrategiesProcesses use on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, transfer and analysis. Data sets grow in size by gathering logs on servers, cameras and wireless ...
DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining
ICDM '08: Proceedings of the 2008 Eighth IEEE International Conference on Data MiningHuge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several ...
Comments