Abstract
Artificial Intelligence (AI) has become the key driving force in Industrial Automation. Machine learning (ML) and Deep Learning (DL) can be considered to be the components of AI which rely on data for model training. Data generation has increased due to the Internet, connected devices, mobile devices and social networking which in turn have also given rise to cybercrime and cyber thefts. To prevent those and preserve the identity of individuals in the public data, government and policymakers have put stringent privacy-preserving laws. The economy of data collection, quality of data in the public domain, and data bias have made data accessibility and its usage a challenge for AI/ML training for research work or industrial purposes. This has forced researchers to look into the alternative. Synthetic Data offers a promising solution to overcome the data challenges. The last few years have seen many studies conducted to verify the utility and privacy protection capability of synthetic data. However, all of these have been exploratory. This paper focuses on various methods of synthetic data generation and their validation metrics. It opens up a few questions that need further study before we conclude that synthetic data offers a universal solution for AI and ML.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
McCarthy, J.: Artificial intelligence, logic and formalizing common sense. Philos. Log. Artif. Intell., 161–190 (1989). https://doi.org/10.1007/978-94-009-2448-2_6
Ongsulee, P.: Artificial intelligence, machine learning and deep learning (2018). https://doi.org/10.1109/ICTKE.2017.8259629
Surya, L.: An exploratory study of DevOps and it’s future in the United States. Int. J. Creat. Res. Thoughts 3(2), 2320–2882 (2016)
Yale, A., et al.: Generation and evaluation of privacy preserving synthetic health data. To cite this version: HAL Id: hal-03158544 (2021)
Emam, K., Mosquera, L., Hoptroff, R., Safari, O.M.C.: Practical Synthetic Data Generation, p. 175 (2020).
Liu, J., Li, J., Li, W., Wu, J.: Rethinking big data: a review on the data quality and usage issues. ISPRS J. Photogramm. Remote Sens. 115, 134–142 (2016). https://doi.org/10.1016/j.isprsjprs.2015.11.006
Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1) (2018). https://doi.org/10.1186/s40537-018-0151-6
Haenlein, M., Kaplan, A.: A brief history of artificial intelligence: on the past, present, and future of artificial intelligence. Calif. Manage. Rev. 61(4), 5–14 (2019). https://doi.org/10.1177/0008125619864925
Das, S., Dey, A., Pal, A., Roy, N.: Applications of artificial intelligence in machine learning: review and prospect. Int. J. Comput. Appl. 115(9), 31–41 (2015). https://doi.org/10.5120/20182-2402
Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., Aljaaf, A.J.: A systematic review on supervised and unsupervised machine learning algorithms for data science
El Naqa, I., Murphy, M.J.: Machine learning in radiation oncology. In: Machine Learning in Radiation Oncology, pp. 3–11 (2015). https://doi.org/10.1007/978-3-319-18305-3
Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., Fotiadis, D.I.: Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015). https://doi.org/10.1016/j.csbj.2014.11.005
L’Heureux, A., Grolinger, K., Elyamany, H.F., Capretz, M.A.M.: Machine learning with Big Data: challenges and approaches. IEEE Access 5, 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365
Jain, P., Gyanchandani, M., Khare, N.: Big data privacy: a technological perspective and review. J. Big Data 3(1) (2016). https://doi.org/10.1186/s40537-016-0059-y
Kaur, H., Pannu, H.S., Malhi, A.K.: A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput. Surv. 52(4) (2019). https://doi.org/10.1145/3343440
Rubin, D.B.: Statistical disclosure limitation (SDL). J. Off. Statis., 461–468 (1993). https://doi.org/10.1007/978-0-387-39940-9_3686
Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section, American Statistical Association (1988)
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. Lecture Notes Computer Science (including Subseries Lecture Notes Artificial Intelligence, Lecture Notes Bioinformatics), vol. 4004 LNCS, pp. 486–503 (2006). https://doi.org/10.1007/11761679_29
Kaaniche, N., Laurent, M., Belguith, S.: Privacy enhancing technologies for solving the privacy-personalization paradox: taxonomy and survey. J. Netw. Comput. Appl. 171(Jan), 102807 (2020). https://doi.org/10.1016/j.jnca.2020.102807
Reiter, J.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 1–19 (2002) [Online]. Available: http://www.stat.duke.edu/~jerry/Papers/jos02.pdf
Raghunathan, T.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003) [Online]. Available: http://hbanaszak.mjr.uw.edu.pl/TempTxt/RaghunathanEtAl_2003_Multiple_Imputation_for_Statistical_Disclosure_Limitation.pdf
Raghunathan, T.E.: Synthetic data. Annu. Rev. Stat. Its Appl. 8, 129–140 (2021). https://doi.org/10.1146/annurev-statistics-040720-031848
Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2003) [Online]. Available: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/using-cart-to-generate-partially-synthetic-public-use-microdata.pdf
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11) (2016). https://doi.org/10.18637/jss.v074.i11
Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings—3rd IEEE International Conference on Data Science and Advanced Analytics DSAA 2016, pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Priv Bayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4) (2017). https://doi.org/10.1145/3134428
Ping, H., Stoyanovich, J., Howe, B.: Data synthesizer: privacy-preserving synthetic datasets. In: ACM International Conference Proceeding Series, vol. Part F1286 (2017). https://doi.org/10.1145/3085504.3091117
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014) [Online]. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using GAN for improved liver lesion classification. In: Proceedings—IEEE International Symposium on Biomedical Imaging, vol. 2018-April, pp. 289–293 (2018). https://doi.org/10.1109/ISBI.2018.8363576
El Emam, K.: Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 18(4), 56–59 (2020). https://doi.org/10.1109/MSEC.2020.2992821
Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: Proceedings—2019 IEEE International Conference on Big Data (IEEE BigData 2019), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476
Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: ACM International Conference Proceeding Series (2019).https://doi.org/10.1145/3339252.3339281
Heyburn, R., et al.: Machine learning using synthetic and real data: similarity of evaluation metrics for different healthcare datasets and for different algorithms, pp. 1281–1291 (2018). https://doi.org/10.1142/9789813273238_0160
Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 1–18 (2021). https://doi.org/10.3390/app11052158
Cheng, V., Suriyakumar, V.M., Dullerud, N., Joshi, S., Ghassemi, M.: Can you fake it until you make it?: Impacts of differentially private synthetic data on downstream classification fairness. In: FAccT 2021—Proceedings 2021 ACM Conference Fairness, Accountability, Transparency, pp. 149–160 (2021). https://doi.org/10.1145/3442188.3445879
Ganev, G., Oprisanu, B., De Cristofaro, E.: Robin Hood and Matthew effects—differential privacy has disparate impact on synthetic data (2021) [Online]. http://arxiv.org/abs/2109.11429
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kiran, A., Kumar, S.S. (2023). Synthetic Data and Its Evaluation Metrics for Machine Learning. In: So-In, C., Londhe, N.D., Bhatt, N., Kitsing, M. (eds) Information Systems for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 324. Springer, Singapore. https://doi.org/10.1007/978-981-19-7447-2_43
Download citation
DOI: https://doi.org/10.1007/978-981-19-7447-2_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7446-5
Online ISBN: 978-981-19-7447-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)