Evidence-based medicine is useful for clinical decisions. A cornerstone of the evidence-based medicine is the hierarchical system of classifying levels of evidence. In this system, randomized controlled trials are often assigned the highest level because randomization is the most reliable method to control confounding factors. However, not all randomized controlled trials are conducted properly and their results should be scrutinized carefully [1]. Major limitations of this method are the high cost of conducting adequately powered studies, and the amount of time consumption and frustration to authors by extensive regulatory requirements, delays in approval, and unnecessary bureaucratic procedures. Another major limitation is that clinical trials involve selected patients with informed consent who are treated according to protocols that might not represent real-world practice. Possible solutions to the limitations above have been described such as registry-based randomized clinical trials [2, 3]. However, again, maintaining registries is costly, and data elements must be manually abstracted.

Alternatively to traditional registries, obtaining electronic health records within a healthcare system is often cumbersome, and sharing electronic data across health systems remains extremely uncommon, particularly because of questions relating to patient privacy and data ownership [4]. In the era of computers and digital transformation and governance, another possible solution is the use of synthetic data derivatives with the incorporation of artificial intelligence (AI) and machine learning (ML) methods.

Artificial intelligence and machine learning

AI is the reproduction of human intelligence via special programs and computers that are trained in a way that simulates human cognitive functions. ML is a category of AI that refers to algorithms designed on computers, which learn via training and imputing new data. Both techniques are promising and helpful in a variety of medical fields to improve patients care in the diagnosis, management, research, and systems analysis. Orthopaedic surgery is amenable to digital transformation by AI. As the amount of patients’ data increases rapidly, efficient process and analysis of all gathered information in order to conduct research and decide on the best therapies for any given orthopaedic disease is a very challenging task [5,6,7]. In that setting, clinicians will continue to play a vital role in research; AI will not make clinical values redundant, but it will make them more important [8].

Synthetic data generation

Synthetic data is non-reversible, artificially created data that replicates the statistical characteristics and correlations of real-world, raw data. Synthetic healthcare data does not contain identifiable information (e.g., names and dates of birth) because it uses a statistical approach to create a brand new data set using both discrete and non-discrete variables of interest. Synthetic data protects patient privacy while preserving maximum data utility to conduct research faster, to impact operational processes positively, and to improve patient outcomes ultimately while saving costs and resources [9]. Specifically, synthetic data comes from new groups of patients that do not correspond to real patients but at the same time, it has the same statistical properties and general characteristics as the initial one [10].

Synthetic data generation can be done with statistical stimulation or computational derivation. Statistical simulation uses real data in order to generate artificial data that simulates with great accuracy the disease distribution in the population and has similar characteristics with the real one, derived from patients. Statistical simulation is appropriate for broad descriptive analyses. The main limitation of this method is that it is not able to correlate patient comorbidities with clinical endpoints [11]. Computational derivation uses special computer algorithms to create synthetic data on demand, based on real patients’ data in real time. The synthetic data includes a similar number of patients as the real and also maintains the distribution and covariance structure of variables in the original data [4, 10]. The resulting synthetic data no longer contains data on individual patients but rather is a collection of observations which maintain the statistical properties of the original data. Since the synthetic data does not contain details on real patients, synthetic data can be shared easily and faster between researchers at different institutions. Although the synthetic patients are not real, they are not fake either. They have the same characteristics as the real including similar number of patients and similar distribution [4, 10]. Therefore, synthetic data can be a very promising alternative to classic clinical registries that come with great costs and restrictions about data sharing, paving the path for increased data availability and exchanges between large health centers [9, 12]. Limitations of synthetic healthcare data generation are statistically significant differences in some variables and predictors between real and synthetic databases; therefore, by using these models, the researchers may not be able to clarify the level of impact individual predictors have in multivariable studies.

A variety of synthetic data generation methods have been developed across a wide range of domains [11]. Anyone with access to the internet has free access to AI applications to generate synthetic patients and research that can be simulated with models for disease progression and standards of healthcare, and to construct research manuscripts, abstracts, and letters to the editor. Synthetic healthcare data has the potential to speed up medical innovation. In this setting, AI methods may offer benefits to journals, publishers, readers, and patients. However, AI applications cannot be listed as authors, and AI methods must be described in detail in the Methods section [13].

At International Orthopaedics, we aim to publish quality research, and we encourage novel methods to conduct research provided that they are clearly described and detailed explained in the methods section of the respective studies. We concur for the necessity of large healthcare data for evidence-based medicine, and we are alert for the protection of patients’ data privacy and identifier issues. Definitely, AI methods cannot be listed as authors in papers [13], but synthetic patients can be used as materials in studies. We expect that in the near future, the use and analysis of synthetic data will advance research by bypassing the inherent limitations of traditional methods and reducing the barriers to data sharing. The use of artificial intelligence to generate publishable material may become common; however, not mentioning it in the material and methods section is a potential fraud that authors and editors should know and avoid.