ABSTRACT
With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking.
Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.
- A. Alexandrov, K. Tzoumas, and V. Markl. Myriad: Scalable and Expressive Data Generation. In VLDB, 2012. Google ScholarDigital Library
- A. Arasu, R. Kaushik, and J. Li. Data Generation Using Declarative Constraints. In SIGMOD, 2011. Google ScholarDigital Library
- C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating Query-aware Test Databases. In SIGMOD, 2007. Google ScholarDigital Library
- N. Bruno and S. Chaudhuri. Flexible Database Generators. In VLDB, pages 1097--1107, 2005. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154, 2010. Google ScholarDigital Library
- M. Frank, M. Poess, and T. Rabl. Efficient Update Data Generation for DBMS Benchmark. In ICPE, 2012. Google ScholarDigital Library
- A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: Towards an industry standard benchmark for big data analytics. In SIGMOD, 2013. Google ScholarDigital Library
- J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly Generating Billion-Record Synthetic Databases. In SIGMOD, pages 243--252, 1994. Google ScholarDigital Library
- J. E. Hoag and C. W. Thompson. A Parallel General-Purpose Synthetic Data Generator. SIGMOD Record, 36(1):19--24, 2007. Google ScholarDigital Library
- K. Houkjær, K. Torp, and R. Wind. Simple and Realistic Data Generation. In VLDB, pages 1243--1246, 2006. Google ScholarDigital Library
- P. J. Lin, B. Samadi, A. Cipolone, D. R. Jeske, S. Cox, C. Rendón, D. Holt, and R. Xiao. Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In ITNG, pages 707--712, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- E. Lo, N. Cheng, and W.-K. Hon. Generating Databases for Query Workloads. PVLDB, 3(1--2):848--859, 2010. Google ScholarDigital Library
- Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan. BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. In WBDB, 2013.Google Scholar
- M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Record, 29(4):64--71, 2000. Google ScholarDigital Library
- M. Poess, T. Rabl, M. Frank, and M. Danisch. A PDGF Implementation for TPC-H. In TPCTC, 2011. Google ScholarDigital Library
- M. Poess, T. Rabl, H.-A. Jacobsen, and B. Caufield. TPC-DI: The First Industry Benchmark for Data Integration. PVLDB, 13(7):1367--1378, 2014. Google ScholarDigital Library
- T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch. A Data Generator for Cloud-Scale Benchmarking. In TPCTC, pages 41--56, 2010. Google ScholarDigital Library
- T. Rabl, M. Poess, M. Danisch, and H.-A. Jacobsen. Rapid Development of Data Generators Using Meta Generators in PDGF. In DBTest, 2013. Google ScholarDigital Library
- T. Rabl, M. Poess, H.-A. Jacobsen, P. E. O'Neil, and E. O'Neil. Variations of the Star Schema Benchmark to Test Data Skew in Database Management Systems. In ICPE, 2013. Google ScholarDigital Library
- E. Shen and L. Antova. Reversing Statistics for Scalable Test Databases Generation. In DBTest, 2013. Google ScholarDigital Library
- V. Sikka. Does the World Need a New Benchmark? http://www.saphana.com/community/blogs/blog/2013/09/16/does-the-world-need-a-new-benchmark, 2013.Google Scholar
- J. M. Stephens and M. Poess. MUDD: a multi-dimensional data generator. In WOSP, pages 104--109, 2004. Google ScholarDigital Library
- Y. Tay, B. T. Dai, D. T. Wang, E. Y. Sun, Y. Lin, and Y. Lin. UpSizeR: Synthetically Scaling an Empirical Relational Database. Information Systems, 38(8):1168--1183, 2013. Google ScholarDigital Library
- E. Torlak. Scalable Test Data Generation from Multidimensional Models. In FSE, 2012. Google ScholarDigital Library
Index Terms
- Just can't get enough: Synthesizing Big Data
Recommendations
Can we analyze big data inside a DBMS?
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAPRelational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, ...
When Good-Enough is Enough: Complex Queries at Fixed Cost
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and ApplicationsCollections of time-series data appear in a wide variety of contexts. To gain insight into the underlying phenomenon (that the data represents), one must analyze the time-series data. Analysis can quickly become challenging for very large data (~...
Comments