Skip to main content

Synthetic Data and Its Evaluation Metrics for Machine Learning

  • Conference paper
  • First Online:
Information Systems for Intelligent Systems

Abstract

Artificial Intelligence (AI) has become the key driving force in Industrial Automation. Machine learning (ML) and Deep Learning (DL) can be considered to be the components of AI which rely on data for model training. Data generation has increased due to the Internet, connected devices, mobile devices and social networking which in turn have also given rise to cybercrime and cyber thefts. To prevent those and preserve the identity of individuals in the public data, government and policymakers have put stringent privacy-preserving laws. The economy of data collection, quality of data in the public domain, and data bias have made data accessibility and its usage a challenge for AI/ML training for research work or industrial purposes. This has forced researchers to look into the alternative. Synthetic Data offers a promising solution to overcome the data challenges. The last few years have seen many studies conducted to verify the utility and privacy protection capability of synthetic data. However, all of these have been exploratory. This paper focuses on various methods of synthetic data generation and their validation metrics. It opens up a few questions that need further study before we conclude that synthetic data offers a universal solution for AI and ML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. McCarthy, J.: Artificial intelligence, logic and formalizing common sense. Philos. Log. Artif. Intell., 161–190 (1989). https://doi.org/10.1007/978-94-009-2448-2_6

  2. Ongsulee, P.: Artificial intelligence, machine learning and deep learning (2018). https://doi.org/10.1109/ICTKE.2017.8259629

  3. Surya, L.: An exploratory study of DevOps and it’s future in the United States. Int. J. Creat. Res. Thoughts 3(2), 2320–2882 (2016)

    Google Scholar 

  4. Yale, A., et al.: Generation and evaluation of privacy preserving synthetic health data. To cite this version: HAL Id: hal-03158544 (2021)

    Google Scholar 

  5. Emam, K., Mosquera, L., Hoptroff, R., Safari, O.M.C.: Practical Synthetic Data Generation, p. 175 (2020).

    Google Scholar 

  6. Liu, J., Li, J., Li, W., Wu, J.: Rethinking big data: a review on the data quality and usage issues. ISPRS J. Photogramm. Remote Sens. 115, 134–142 (2016). https://doi.org/10.1016/j.isprsjprs.2015.11.006

    Article  Google Scholar 

  7. Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1) (2018). https://doi.org/10.1186/s40537-018-0151-6

  8. Haenlein, M., Kaplan, A.: A brief history of artificial intelligence: on the past, present, and future of artificial intelligence. Calif. Manage. Rev. 61(4), 5–14 (2019). https://doi.org/10.1177/0008125619864925

    Article  Google Scholar 

  9. Das, S., Dey, A., Pal, A., Roy, N.: Applications of artificial intelligence in machine learning: review and prospect. Int. J. Comput. Appl. 115(9), 31–41 (2015). https://doi.org/10.5120/20182-2402

    Article  Google Scholar 

  10. Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., Aljaaf, A.J.: A systematic review on supervised and unsupervised machine learning algorithms for data science

    Google Scholar 

  11. El Naqa, I., Murphy, M.J.: Machine learning in radiation oncology. In: Machine Learning in Radiation Oncology, pp. 3–11 (2015). https://doi.org/10.1007/978-3-319-18305-3

  12. Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., Fotiadis, D.I.: Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015). https://doi.org/10.1016/j.csbj.2014.11.005

    Article  Google Scholar 

  13. L’Heureux, A., Grolinger, K., Elyamany, H.F., Capretz, M.A.M.: Machine learning with Big Data: challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365

    Article  Google Scholar 

  14. Jain, P., Gyanchandani, M., Khare, N.: Big data privacy: a technological perspective and review. J. Big Data 3(1) (2016). https://doi.org/10.1186/s40537-016-0059-y

  15. Kaur, H., Pannu, H.S., Malhi, A.K.: A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput. Surv. 52(4) (2019). https://doi.org/10.1145/3343440

  16. Rubin, D.B.: Statistical disclosure limitation (SDL). J. Off. Statis., 461–468 (1993). https://doi.org/10.1007/978-0-387-39940-9_3686

  17. Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section, American Statistical Association (1988)

    Google Scholar 

  18. Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. Lecture Notes Computer Science (including Subseries Lecture Notes Artificial Intelligence, Lecture Notes Bioinformatics), vol. 4004 LNCS, pp. 486–503 (2006). https://doi.org/10.1007/11761679_29

  19. Kaaniche, N., Laurent, M., Belguith, S.: Privacy enhancing technologies for solving the privacy-personalization paradox: taxonomy and survey. J. Netw. Comput. Appl. 171(Jan), 102807 (2020). https://doi.org/10.1016/j.jnca.2020.102807

  20. Reiter, J.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 1–19 (2002) [Online]. Available: http://www.stat.duke.edu/~jerry/Papers/jos02.pdf

  21. Raghunathan, T.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003) [Online]. Available: http://hbanaszak.mjr.uw.edu.pl/TempTxt/RaghunathanEtAl_2003_Multiple_Imputation_for_Statistical_Disclosure_Limitation.pdf

  22. Raghunathan, T.E.: Synthetic data. Annu. Rev. Stat. Its Appl. 8, 129–140 (2021). https://doi.org/10.1146/annurev-statistics-040720-031848

    Article  MathSciNet  Google Scholar 

  23. Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2003) [Online]. Available: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/using-cart-to-generate-partially-synthetic-public-use-microdata.pdf

  24. Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11) (2016). https://doi.org/10.18637/jss.v074.i11

  25. Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings—3rd IEEE International Conference on Data Science and Advanced Analytics DSAA 2016, pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49

  26. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Priv Bayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4) (2017). https://doi.org/10.1145/3134428

  27. Ping, H., Stoyanovich, J., Howe, B.: Data synthesizer: privacy-preserving synthetic datasets. In: ACM International Conference Proceeding Series, vol. Part F1286 (2017). https://doi.org/10.1145/3085504.3091117

  28. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  29. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014) [Online]. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

  30. Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using GAN for improved liver lesion classification. In: Proceedings—IEEE International Symposium on Biomedical Imaging, vol. 2018-April, pp. 289–293 (2018). https://doi.org/10.1109/ISBI.2018.8363576

  31. El Emam, K.: Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 18(4), 56–59 (2020). https://doi.org/10.1109/MSEC.2020.2992821

    Article  Google Scholar 

  32. Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: Proceedings—2019 IEEE International Conference on Big Data (IEEE BigData 2019), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476

  33. Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: ACM International Conference Proceeding Series (2019).https://doi.org/10.1145/3339252.3339281

  34. Heyburn, R., et al.: Machine learning using synthetic and real data: similarity of evaluation metrics for different healthcare datasets and for different algorithms, pp. 1281–1291 (2018). https://doi.org/10.1142/9789813273238_0160

  35. Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 1–18 (2021). https://doi.org/10.3390/app11052158

    Article  Google Scholar 

  36. Cheng, V., Suriyakumar, V.M., Dullerud, N., Joshi, S., Ghassemi, M.: Can you fake it until you make it?: Impacts of differentially private synthetic data on downstream classification fairness. In: FAccT 2021—Proceedings 2021 ACM Conference Fairness, Accountability, Transparency, pp. 149–160 (2021). https://doi.org/10.1145/3442188.3445879

  37. Ganev, G., Oprisanu, B., De Cristofaro, E.: Robin Hood and Matthew effects—differential privacy has disparate impact on synthetic data (2021) [Online]. http://arxiv.org/abs/2109.11429

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Kiran .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kiran, A., Kumar, S.S. (2023). Synthetic Data and Its Evaluation Metrics for Machine Learning. In: So-In, C., Londhe, N.D., Bhatt, N., Kitsing, M. (eds) Information Systems for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 324. Springer, Singapore. https://doi.org/10.1007/978-981-19-7447-2_43

Download citation

Publish with us

Policies and ethics