research-article

Just can't get enough: Synthesizing Big Data

Authors:
Tilmann Rabl

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Manuel Danisch

bankmark, Passau, Germany

bankmark, Passau, Germany
View Profile

,
Michael Frank

bankmark, Passau, Germany

bankmark, Passau, Germany
View Profile

,
Sebastian Schindler

bankmark, Passau, Germany

bankmark, Passau, Germany
View Profile

,
Hans-Arno Jacobsen

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataMay 2015Pages 1457–1462https://doi.org/10.1145/2723372.2735378

Published:27 May 2015Publication History

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1457–1462

ABSTRACT

With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking.

Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.

References

A. Alexandrov, K. Tzoumas, and V. Markl. Myriad: Scalable and Expressive Data Generation. In VLDB, 2012. Google ScholarDigital Library
A. Arasu, R. Kaushik, and J. Li. Data Generation Using Declarative Constraints. In SIGMOD, 2011. Google ScholarDigital Library
C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating Query-aware Test Databases. In SIGMOD, 2007. Google ScholarDigital Library
N. Bruno and S. Chaudhuri. Flexible Database Generators. In VLDB, pages 1097--1107, 2005. Google ScholarDigital Library
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154, 2010. Google ScholarDigital Library
M. Frank, M. Poess, and T. Rabl. Efficient Update Data Generation for DBMS Benchmark. In ICPE, 2012. Google ScholarDigital Library
A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: Towards an industry standard benchmark for big data analytics. In SIGMOD, 2013. Google ScholarDigital Library
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly Generating Billion-Record Synthetic Databases. In SIGMOD, pages 243--252, 1994. Google ScholarDigital Library
J. E. Hoag and C. W. Thompson. A Parallel General-Purpose Synthetic Data Generator. SIGMOD Record, 36(1):19--24, 2007. Google ScholarDigital Library
K. Houkjær, K. Torp, and R. Wind. Simple and Realistic Data Generation. In VLDB, pages 1243--1246, 2006. Google ScholarDigital Library
P. J. Lin, B. Samadi, A. Cipolone, D. R. Jeske, S. Cox, C. Rendón, D. Holt, and R. Xiao. Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In ITNG, pages 707--712, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
E. Lo, N. Cheng, and W.-K. Hon. Generating Databases for Query Workloads. PVLDB, 3(1--2):848--859, 2010. Google ScholarDigital Library
Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan. BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. In WBDB, 2013.Google Scholar
M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Record, 29(4):64--71, 2000. Google ScholarDigital Library
M. Poess, T. Rabl, M. Frank, and M. Danisch. A PDGF Implementation for TPC-H. In TPCTC, 2011. Google ScholarDigital Library
M. Poess, T. Rabl, H.-A. Jacobsen, and B. Caufield. TPC-DI: The First Industry Benchmark for Data Integration. PVLDB, 13(7):1367--1378, 2014. Google ScholarDigital Library
T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch. A Data Generator for Cloud-Scale Benchmarking. In TPCTC, pages 41--56, 2010. Google ScholarDigital Library
T. Rabl, M. Poess, M. Danisch, and H.-A. Jacobsen. Rapid Development of Data Generators Using Meta Generators in PDGF. In DBTest, 2013. Google ScholarDigital Library
T. Rabl, M. Poess, H.-A. Jacobsen, P. E. O'Neil, and E. O'Neil. Variations of the Star Schema Benchmark to Test Data Skew in Database Management Systems. In ICPE, 2013. Google ScholarDigital Library
E. Shen and L. Antova. Reversing Statistics for Scalable Test Databases Generation. In DBTest, 2013. Google ScholarDigital Library
V. Sikka. Does the World Need a New Benchmark? http://www.saphana.com/community/blogs/blog/2013/09/16/does-the-world-need-a-new-benchmark, 2013.Google Scholar
J. M. Stephens and M. Poess. MUDD: a multi-dimensional data generator. In WOSP, pages 104--109, 2004. Google ScholarDigital Library
Y. Tay, B. T. Dai, D. T. Wang, E. Y. Sun, Y. Lin, and Y. Lin. UpSizeR: Synthetically Scaling an Empirical Relational Database. Information Systems, 38(8):1168--1183, 2013. Google ScholarDigital Library
E. Torlak. Scalable Test Data Generation from Multidimensional Models. In FSE, 2012. Google ScholarDigital Library

Index Terms

Just can't get enough: Synthesizing Big Data
1. Information systems
  1. Data management systems
    1. Database administration
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Data Just Right: Introduction to Large-Scale Data & Analytics
Read More
Can we analyze big data inside a DBMS?
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, ...
Read More
When Good-Enough is Enough: Complex Queries at Fixed Cost
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications

Collections of time-series data appear in a wide variety of contexts. To gain insight into the underlying phenomenon (that the data represents), one must analyze the time-series data. Analysis can quickly become challenging for very large data (~...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data generator
dbsynth
pdgf
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 472
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Just can't get enough: Synthesizing Big Data

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Just Right: Introduction to Large-Scale Data & Analytics

Can we analyze big data inside a DBMS?

When Good-Enough is Enough: Complex Queries at Fixed Cost