Synthetic Data and Its Evaluation Metrics for Machine Learning

Kiran, A.; Kumar, S. Saravana

doi:10.1007/978-981-19-7447-2_43

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 324))

437 Accesses
3 Citations

Abstract

Artificial Intelligence (AI) has become the key driving force in Industrial Automation. Machine learning (ML) and Deep Learning (DL) can be considered to be the components of AI which rely on data for model training. Data generation has increased due to the Internet, connected devices, mobile devices and social networking which in turn have also given rise to cybercrime and cyber thefts. To prevent those and preserve the identity of individuals in the public data, government and policymakers have put stringent privacy-preserving laws. The economy of data collection, quality of data in the public domain, and data bias have made data accessibility and its usage a challenge for AI/ML training for research work or industrial purposes. This has forced researchers to look into the alternative. Synthetic Data offers a promising solution to overcome the data challenges. The last few years have seen many studies conducted to verify the utility and privacy protection capability of synthetic data. However, all of these have been exploratory. This paper focuses on various methods of synthetic data generation and their validation metrics. It opens up a few questions that need further study before we conclude that synthetic data offers a universal solution for AI and ML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

McCarthy, J.: Artificial intelligence, logic and formalizing common sense. Philos. Log. Artif. Intell., 161–190 (1989). https://doi.org/10.1007/978-94-009-2448-2_6
Ongsulee, P.: Artificial intelligence, machine learning and deep learning (2018). https://doi.org/10.1109/ICTKE.2017.8259629
Surya, L.: An exploratory study of DevOps and it’s future in the United States. Int. J. Creat. Res. Thoughts 3(2), 2320–2882 (2016)
Google Scholar
Yale, A., et al.: Generation and evaluation of privacy preserving synthetic health data. To cite this version: HAL Id: hal-03158544 (2021)
Google Scholar
Emam, K., Mosquera, L., Hoptroff, R., Safari, O.M.C.: Practical Synthetic Data Generation, p. 175 (2020).
Google Scholar
Liu, J., Li, J., Li, W., Wu, J.: Rethinking big data: a review on the data quality and usage issues. ISPRS J. Photogramm. Remote Sens. 115, 134–142 (2016). https://doi.org/10.1016/j.isprsjprs.2015.11.006
Article Google Scholar
Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1) (2018). https://doi.org/10.1186/s40537-018-0151-6
Haenlein, M., Kaplan, A.: A brief history of artificial intelligence: on the past, present, and future of artificial intelligence. Calif. Manage. Rev. 61(4), 5–14 (2019). https://doi.org/10.1177/0008125619864925
Article Google Scholar
Das, S., Dey, A., Pal, A., Roy, N.: Applications of artificial intelligence in machine learning: review and prospect. Int. J. Comput. Appl. 115(9), 31–41 (2015). https://doi.org/10.5120/20182-2402
Article Google Scholar
Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., Aljaaf, A.J.: A systematic review on supervised and unsupervised machine learning algorithms for data science
Google Scholar
El Naqa, I., Murphy, M.J.: Machine learning in radiation oncology. In: Machine Learning in Radiation Oncology, pp. 3–11 (2015). https://doi.org/10.1007/978-3-319-18305-3
Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., Fotiadis, D.I.: Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015). https://doi.org/10.1016/j.csbj.2014.11.005
Article Google Scholar
L’Heureux, A., Grolinger, K., Elyamany, H.F., Capretz, M.A.M.: Machine learning with Big Data: challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365
Article Google Scholar
Jain, P., Gyanchandani, M., Khare, N.: Big data privacy: a technological perspective and review. J. Big Data 3(1) (2016). https://doi.org/10.1186/s40537-016-0059-y
Kaur, H., Pannu, H.S., Malhi, A.K.: A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput. Surv. 52(4) (2019). https://doi.org/10.1145/3343440
Rubin, D.B.: Statistical disclosure limitation (SDL). J. Off. Statis., 461–468 (1993). https://doi.org/10.1007/978-0-387-39940-9_3686
Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section, American Statistical Association (1988)
Google Scholar
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. Lecture Notes Computer Science (including Subseries Lecture Notes Artificial Intelligence, Lecture Notes Bioinformatics), vol. 4004 LNCS, pp. 486–503 (2006). https://doi.org/10.1007/11761679_29
Kaaniche, N., Laurent, M., Belguith, S.: Privacy enhancing technologies for solving the privacy-personalization paradox: taxonomy and survey. J. Netw. Comput. Appl. 171(Jan), 102807 (2020). https://doi.org/10.1016/j.jnca.2020.102807
Reiter, J.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 1–19 (2002) [Online]. Available: http://www.stat.duke.edu/~jerry/Papers/jos02.pdf
Raghunathan, T.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003) [Online]. Available: http://hbanaszak.mjr.uw.edu.pl/TempTxt/RaghunathanEtAl_2003_Multiple_Imputation_for_Statistical_Disclosure_Limitation.pdf
Raghunathan, T.E.: Synthetic data. Annu. Rev. Stat. Its Appl. 8, 129–140 (2021). https://doi.org/10.1146/annurev-statistics-040720-031848
Article MathSciNet Google Scholar
Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2003) [Online]. Available: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/using-cart-to-generate-partially-synthetic-public-use-microdata.pdf
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11) (2016). https://doi.org/10.18637/jss.v074.i11
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings—3rd IEEE International Conference on Data Science and Advanced Analytics DSAA 2016, pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Priv Bayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4) (2017). https://doi.org/10.1145/3134428
Ping, H., Stoyanovich, J., Howe, B.: Data synthesizer: privacy-preserving synthetic datasets. In: ACM International Conference Proceeding Series, vol. Part F1286 (2017). https://doi.org/10.1145/3085504.3091117
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014) [Online]. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using GAN for improved liver lesion classification. In: Proceedings—IEEE International Symposium on Biomedical Imaging, vol. 2018-April, pp. 289–293 (2018). https://doi.org/10.1109/ISBI.2018.8363576
El Emam, K.: Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 18(4), 56–59 (2020). https://doi.org/10.1109/MSEC.2020.2992821
Article Google Scholar
Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: Proceedings—2019 IEEE International Conference on Big Data (IEEE BigData 2019), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476
Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: ACM International Conference Proceeding Series (2019).https://doi.org/10.1145/3339252.3339281
Heyburn, R., et al.: Machine learning using synthetic and real data: similarity of evaluation metrics for different healthcare datasets and for different algorithms, pp. 1281–1291 (2018). https://doi.org/10.1142/9789813273238_0160
Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 1–18 (2021). https://doi.org/10.3390/app11052158
Article Google Scholar
Cheng, V., Suriyakumar, V.M., Dullerud, N., Joshi, S., Ghassemi, M.: Can you fake it until you make it?: Impacts of differentially private synthetic data on downstream classification fairness. In: FAccT 2021—Proceedings 2021 ACM Conference Fairness, Accountability, Transparency, pp. 149–160 (2021). https://doi.org/10.1145/3442188.3445879
Ganev, G., Oprisanu, B., De Cristofaro, E.: Robin Hood and Matthew effects—differential privacy has disparate impact on synthetic data (2021) [Online]. http://arxiv.org/abs/2109.11429

Download references

Author information

Authors and Affiliations

Department of CSE, SOET, CMR University, Bangalore, India
A. Kiran & S. Saravana Kumar

Authors

A. Kiran
View author publications
You can also search for this author in PubMed Google Scholar
S. Saravana Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Kiran .

Editor information

Editors and Affiliations

Khon Kaen University, Khon Kaen, Thailand
Chakchai So-In
National Institute of Technology, Raipur, Chhattisgarh, India
Narendra D. Londhe
Nirma University, Ahmedabad, Gujarat, India
Nityesh Bhatt
Estonian Business School, Tallinn, Estonia
Meelis Kitsing

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kiran, A., Kumar, S.S. (2023). Synthetic Data and Its Evaluation Metrics for Machine Learning. In: So-In, C., Londhe, N.D., Bhatt, N., Kitsing, M. (eds) Information Systems for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 324. Springer, Singapore. https://doi.org/10.1007/978-981-19-7447-2_43

Download citation

DOI: https://doi.org/10.1007/978-981-19-7447-2_43
Published: 02 March 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7446-5
Online ISBN: 978-981-19-7447-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics