Skip to main content
Log in

Tuple MapReduce and Pangool: an associated implementation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper presents Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting, or joins. This paper shows as well Pangool, an open-source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance. Additionally, this paper shows: pseudo-codes for relational joins, rollup, and the PageRank algorithm; a Pangool’s code example; benchmark results comparing Pangool with existing approaches; reports from users of Pangool in industry; and the description of a distributed database exploiting Pangool. These results show that Tuple MapReduce can be used as a direct, better-suited replacement of the MapReduce model in current implementations without the need of modifying key system fundamentals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.cascading.org.

  2. http://www.cloudera.com/.

  3. https://github.com/cloudera/crunch.

  4. http://www.cascading.org.

  5. http://pangool.net.

  6. http://www.cascading.org.

  7. www.google.com.

  8. http://pangool.net.

  9. http://pangool.net/userguide/schemas.html.

  10. https://github.com/datasalt/pangool/tree/master/examples/src/main/java/com/datasalt/pangool/examples.

  11. https://github.com/datasalt/pangool/tree/master/examples.

  12. https://github.com/datasalt/pangool-benchmark.

  13. https://github.com/cloudera/.

  14. http://www.cascading.org/.

  15. http://pangool.net/benchmark.html.

  16. http://www.datasalt.com.

  17. http://highscalability.com/blog/2013/1/7/analyzing-billions-of-credit-card-transactions-and-serving-l.html.

  18. https://groups.google.com/forum/#!forum/pangool-user.

  19. http://oozie.apache.org/.

  20. http://sploutsql.com/.

  21. http://www.sqlite.org/.

  22. https://github.com/datasalt/splout-db/blob/master/splout-hadoop/src/main/java/com/splout/db/hadoop/TablespaceGenerator.java.

  23. http://sna-projects.com/azkaban/.

References

  1. Agarwal A, Slee M, Kwiatkowski M (2007) Thrift: scalable cross-language services implementation, technical report, Facebook. http://incubator.apache.org/thrift/static/thrift-20070401.pdf

  2. Beyer KS, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Özcan F, Shekita EJ (2011) Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12):1272–1283

    Google Scholar 

  3. Borthakur D (2007) The hadoop distributed file system: architecture and design. The Apache Software Foundation, Los Angeles. https://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf

  4. Byambajav B, Wlodarczyk T, Rong C, LePendu P, Shah N (2012) Performance of left outer join on hadoop with right side within single node memory size. In: 26th international conference on advanced information networking and applications workshops (WAINA), 2012, pp 1075–1080

  5. Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N (2010a) FlumeJava: easy, efficient data-parallel pipelines. SIGPLAN Not 45(6):363–375

    Article  Google Scholar 

  6. Chambers C, Raniwala A, Perry F, Adams S, Henry RR, Bradshaw R, Weizenbaum N (2010b) Flumejava: easy, efficient data-parallel pipelines. In: Proceedings of the 2010 ACM SIGPLAN conference on programming language design and implementation. PLDI ’10, ACM, New York, NY, USA, pp 363–375

  7. Chu CT, Kim SK, Lin YA, Yu Y, Bradski GR, Ng AY, Olukotun K (2006) Map-reduce for machine learning on multicore. In: Schölkopf B, Platt JC, Hoffman T (eds) NIPS. MIT Press, Cambridge, MA, pp 281–288

    Google Scholar 

  8. Dayal U, Castellanos M, Simitsis A, Wilkinson K (2009) Data integration flows for business intelligence. In: Proceedings of the 12th international conference on extending database technology: advances in database technology. EDBT ’09, ACM, New York, NY, USA, pp 1–11

  9. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. OSDI’04, ACM, USENIX Association, pp 10–10

  10. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  11. Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77

    Article  Google Scholar 

  12. Deligiannis P, Loidl H-W, Kouidi E (2012) Improving the diagnosis of mild hypertrophic cardiomyopathy with mapreduce. In: In the third international workshop on MapReduce and its applications (MAPREDUCE’12)

  13. Ferrera P, de Prado I, Palacios E, Fernandez-Marquez J, Di Marzo Serugendo G (2012) Tuple MapReduce: beyond classic MapReduce. In: IEEE 12th international conference on data mining (ICDM), pp 260–269

  14. Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava U (2009) Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proc VLDB Endow 2(2):1414–1425

    Article  Google Scholar 

  15. Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’08, ACM, New York, NY, USA, pp 920–927

  16. Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web, technical report, Stanford Digital Library Technologies Project

  17. Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. Sci Program J 13:277–298. http://research.google.com/archive/sawzall.html

    Google Scholar 

  18. Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages. In: Proceedings of the 9th international conference on advanced parallel processing technologies. APPT’11, Springer, Berlin, pp 58–72

  19. Taylor R (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(Suppl 12):S1+

    Google Scholar 

  20. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow (PVLDB) 2(2):1626–1629

    Article  Google Scholar 

  21. Yang HC, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data. SIGMOD ’07, ACM, New York, NY, USA, pp 1029–1040

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Luis Fernandez-Marquez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferrera, P., De Prado, I., Palacios, E. et al. Tuple MapReduce and Pangool: an associated implementation. Knowl Inf Syst 41, 531–557 (2014). https://doi.org/10.1007/s10115-013-0705-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0705-z

Keywords

Navigation