skip to main content
10.1145/2723372.2735378acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Just can't get enough: Synthesizing Big Data

Published:27 May 2015Publication History

ABSTRACT

With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking.

Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.

References

  1. A. Alexandrov, K. Tzoumas, and V. Markl. Myriad: Scalable and Expressive Data Generation. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, R. Kaushik, and J. Li. Data Generation Using Declarative Constraints. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating Query-aware Test Databases. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Bruno and S. Chaudhuri. Flexible Database Generators. In VLDB, pages 1097--1107, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Frank, M. Poess, and T. Rabl. Efficient Update Data Generation for DBMS Benchmark. In ICPE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: Towards an industry standard benchmark for big data analytics. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly Generating Billion-Record Synthetic Databases. In SIGMOD, pages 243--252, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. E. Hoag and C. W. Thompson. A Parallel General-Purpose Synthetic Data Generator. SIGMOD Record, 36(1):19--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Houkjær, K. Torp, and R. Wind. Simple and Realistic Data Generation. In VLDB, pages 1243--1246, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. J. Lin, B. Samadi, A. Cipolone, D. R. Jeske, S. Cox, C. Rendón, D. Holt, and R. Xiao. Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In ITNG, pages 707--712, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Lo, N. Cheng, and W.-K. Hon. Generating Databases for Query Workloads. PVLDB, 3(1--2):848--859, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan. BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. In WBDB, 2013.Google ScholarGoogle Scholar
  14. M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Record, 29(4):64--71, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Poess, T. Rabl, M. Frank, and M. Danisch. A PDGF Implementation for TPC-H. In TPCTC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Poess, T. Rabl, H.-A. Jacobsen, and B. Caufield. TPC-DI: The First Industry Benchmark for Data Integration. PVLDB, 13(7):1367--1378, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch. A Data Generator for Cloud-Scale Benchmarking. In TPCTC, pages 41--56, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Rabl, M. Poess, M. Danisch, and H.-A. Jacobsen. Rapid Development of Data Generators Using Meta Generators in PDGF. In DBTest, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Rabl, M. Poess, H.-A. Jacobsen, P. E. O'Neil, and E. O'Neil. Variations of the Star Schema Benchmark to Test Data Skew in Database Management Systems. In ICPE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Shen and L. Antova. Reversing Statistics for Scalable Test Databases Generation. In DBTest, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Sikka. Does the World Need a New Benchmark? http://www.saphana.com/community/blogs/blog/2013/09/16/does-the-world-need-a-new-benchmark, 2013.Google ScholarGoogle Scholar
  22. J. M. Stephens and M. Poess. MUDD: a multi-dimensional data generator. In WOSP, pages 104--109, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Tay, B. T. Dai, D. T. Wang, E. Y. Sun, Y. Lin, and Y. Lin. UpSizeR: Synthetically Scaling an Empirical Relational Database. Information Systems, 38(8):1168--1183, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Torlak. Scalable Test Data Generation from Multidimensional Models. In FSE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Just can't get enough: Synthesizing Big Data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
        May 2015
        2110 pages
        ISBN:9781450327589
        DOI:10.1145/2723372

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader