Quality of reporting and adherence to the ARRIVE guidelines 2.0 for preclinical degradable metal research in animal models of bone defect and fracture: a systematic review

Abstract In vivo testing is crucial for the evaluation of orthopedic implant efficacy and safety. However, the translation and reproducibility of preclinical animal experiments are not always satisfactory, and reporting quality is among the essential factors that ensure appropriate delivery of information. In this study, we assessed the reporting quality of in vivo investigations that examined the use of degradable metal materials in fracture or bone defect repair. We employed scientific databases, such as PubMed, EMBASE, Web of Science, Cochrane Library, CNKI, WanFang, VIP and Sinomed to screen for in vivo investigations on fracture or bone defect repair using degradable metal materials, and extracted both epidemiological and main characteristics of eligible studies, and assessed their reporting quality using the ARRIVE guidelines 2.0. Overall, 263 publications were selected, including 275 animal experiments. The overall coincidence rate of Essential 10 (22 sub-items) and Recommended Set (16 sub-items) were 42.0% and 41.5%, respectively. Based on our analysis, the reporting quality of the published in vivo investigations examining fracture/bone defect repair with degradable metal materials was low, and there was a lack of transparent, accurate and comprehensive reporting on key elements of the experimental design and other elements that are meant to avoid bias.


Introduction
Animal experimentation is an essential bridge that connects basic and clinical research. It is also important for the evaluation of safety and performance of new orthopedic implants [1]. In recent years, the number of bioscience journals increased significantly [2]. But, of particular concern is that numerous animal studies lack appropriate reporting quality [3,4]. For instance, a study assessing the reporting quality of animal experiments examining urethroplasty revealed insufficient description of key elements of experimental design, such as, how the sample size was calculated (0/6, 0%) and how the experimental animals were allocated to groups (0/6, 0%) [5]. This can seriously affect the validity, reliability and usefulness of research results, and can greatly increase the biological risks in translation of these results into clinical practice and (or) evidence-based guidelines [5]. Similar conclusions were reported in other disciplines, including animal studies in critical care [6], otorhinolaryngology [7], rheumatology [3], sports medicine [8] and so on.
Research papers are not only an important bridge between evidence producers and evidence users, but also the main medium for learning and acquiring knowledge. The characteristics of experimental animals (such as, species, strain, sex, heredity and so on), randomization, blinding, sample-size calculation, statistical method and so on are critical elements of systematic reviews, it was found that the published experiments in the field of degradable metal materials repairing fractures or bone defects did not report sufficient and transparent information on randomization, blind method, sample size calculation and other important factors affecting the internal validity, which may also be one of the main factors that ultimately led to the lack of reproducibility and the obstruction of translation of animal experiments in this field [25,33].
In the present study, we employed items in the ARRIVE guidelines 2.0 to evaluate the reporting quality of preclinical animal experimentation involving degradable metal materials in treating fracture/bone defects for the first time. This dissertation aims to present the current status of reporting quality in this field, screen and summarize the deficiencies in animal experimentation reports, and promote the reporting quality, reproducibility and translation of animal experiments in this field. Our findings will simultaneously enable researchers, editors, reviewers and other relevant journal staff to gain a better understanding of the report overview of various items in the current guidelines, and encourage relevant journals to introduce ARRIVE guidelines 2.0 into the 'guide for authors' to enhance the reporting quality of animal experimentations.

Inclusion and exclusion criteria
Inclusion criteria (i) Population: included studies were animal studies of bone fracture or bone defect; (ii) Interventions: fracture or bone defect repair with degradable metals and their alloys or modified degradable metals and their alloys (composites, coatings and surface modification) [25,33]; (iii) Controlled studies, without any restriction on randomization.

Exclusion criteria
Reviews, conferences, comments, abstracts, non-full-text and non-English/Chinese literature were eliminated from analysis.

Search strategy
We screened scientific databases like PubMed, EMBASE, Cochrane Library, Web of Science, China National Knowledge Infrastructure or CNKI, Wanfang Data Knowledge Service Platform, Chinese Scientific Journal Database or VIP and China Biomedical Literature Database or CBM from their inception date till October 2020. The retrieval method was the combination of free words and medical subject words (MeSH). The retrieval strategy was '(animal studies) AND (degradable metal) AND (bone fracture OR bone defect)'. Table 1 summarizes the search strategy of PubMed [25,33], and all search strategies using the English and Chinese are provided in Supplementary data 1.

Literature screening
Two highly trained researchers (F.D. and K.H.) independently employed Endnote X9 to screen and cross-check relevant papers. Disputes were resolved via discussion with a third party (B.M.). In terms of the primary screening, the titles and abstracts were reviewed, according to the pre-set inclusion/exclusion criteria, and subsequently, the full text was reviewed upon exclusion of irrelevant literature to determine the validity of its inclusion. Please refer to the Supplementary document for the full-text screening documents (Supplementary data 2).

Data extraction checklist and method
Two highly trained researchers (F.D. and K.H.) extracted data, according to the pre-set full-text data extraction checklist. This included: (i) epidemiological characteristics of included studies: published journal name, 2020 journal impact factor, year of publication, country of first author and funding source; (ii) major characteristics of included studies: experimental animal disease model, species and strains of experimental animals, sample size, median and interquartile range (IQR) of experimental follow-up time, measuring method and time point of degradation and gas formation outcome indicators, types of degradable metal material and types of control groups. The Supplementary documents (Supplementary data 3-5) provide additional information on our data extraction procedure.

Standardization and transformation for data processing
Since the follow-up time (day, week, month and so on) measuring units may be inconsistent between the included studies, we regarded 'day' as the statistical measuring unit for the convenience of statistics.
Since the definitions of the control group may be inconsistent between the included studies, we made a unified definition, in accordance with previous studies [25,33], to better facilitate statistics and present our results. Our results were mainly divided into positive control and other types of controls (Supplementary data 6). The control groups of each study were classified upon reviewing the full text.
Since some studies included two or more animal experiments that met our inclusion criteria, we separately extracted the 'major characteristics of the included studies'.

Reporting quality assessment
Arrive guidelines 2.0 ARRIVE 2.0 [19] is primarily used to improve the reporting quality of animal experimentation, and it involves 21 items and 38 subitems. Among them, Essential 10 includes 10 items and 22 subitems, and these indicate the basic items that must be reported in animal studies (Supplementary data 7). This includes 10 aspects of 'study design, sample size, inclusion and exclusion criteria, randomization, blinding, measurement result, statistical methods, experimental animals, experimental procedures, and results'. This extensive evaluation enables reviewers and readers to accurately assess the reliability of animal experimentation results and conclusions. The Recommended Set includes 11 items and 16 sub-items, and the items are recommended for reporting in animal experimentation (Supplementary data 7). This includes 11 aspects of 'summary, background, objectives, ethical statement, housing and husbandry, animal care and monitoring, interpretation/scientific implications, generalisability/translation, protocol registration, data access, and declaration of interests'. Upon the consistent reporting of Essential 10 in the manuscript, items from the Recommendation Set can be included in the journal requirements over time until all 21 items are routinely reported in all manuscripts [10].

Methods of assessing reporting quality
Prior to our evaluation, researchers were systematically trained, and the evaluation criteria were fully analyzed and discussed. Additionally, a pre-experiment was conducted to ensure that both parties (F.D. and K.H.) agree on the standards related to the understanding and interpretation of each item (or sub-item). The aforementioned trained researchers independently assessed the reporting quality of 50 included papers using the ARRIVE guidelines 2.0, and they entered the results into an online electronic database (Tencent electronic documents). Following each evaluation of 10 studies, the data were discussed until both parties reached an agreement.
Two researchers (F.D. and K.H.) employed 21 items and 38 sub-items of the ARRIVE guidelines 2.0 to evaluate the eligible literature, and conducted judgments of 'Yes', 'Partly', 'No' and 'Unclear', based on the literature content. 'Yes' represented that the evaluated literature reported all information related to the corresponding item or sub-item in detail. 'Partly' meant that the evaluated literature only partially reported the information related to the corresponding item or sub-item. 'No' represented that the evaluated literature did not include any information related to a corresponding item or sub-item. Finally, 'Unclear' meant that the reporting quality could not be judged, according to unknown information. Any disputes were resolved via discussion with a third researcher (B.M.).

Standardization and transformation methods for inapplicable items
In terms of certain included studies, the following items in the ARRIVE guidelines 2.0 were not applicable for the evaluation of reporting quality: 6b. For hypothesis-testing studies, specify the primary outcome measure: This item applies to hypothesis-testing studies. Upon a complete review of the full text and research purpose of all research included in this study, we concluded that most research experimental types did not belong to the hypothesistesting category, but belonged to the exploratory research category.
7 Statistical methods: In cases where the research results were qualitative in nature or no statistical analysis was required, this item was evaluated as 'not applicable'.
10b. If applicable, the effect size was reported, with a confidence interval: In cases where the research results were qualitative data or no statistical analysis was required, this item was evaluated as 'not applicable'.
Our evaluations based on the aforementioned items are provided in detail in Supplementary data 9. The studies that were not applicable to the above items were not included when we analyze the reporting quality of the above items.
In addition, since not all items were applicable to the evaluation of all included studies, we employed the weighted average formula to calculate the overall coincidence rate (P Y or PY), noncoincidence rate (P N or PN), partial coincidence rate (P P or PP) and unclear rate (P U or PU). Taking the overall coincidence rate (P Y ) of the ARRIVE 10 Essential (22 sub-items) as an example, the formula we used is as follows: Finally, it should be noted that even if the selected articles included several animal experiments that met our inclusion criteria, we did not evaluate them separately, but as a whole.

Statistical analysis
Data from all item evaluations were statistically analyzed in Excel 2019. The enumeration data are expressed as cases and percentages (%), and the measurement data are expressed as mean 6 standard deviation (x 6S).

Literature search results
Overall, 7042 articles were extracted from four English and four Chinese databases. Among them, 6797 were in English, and retrieved from PubMed (n ¼ 3085), Web of Science (n ¼ 2107), EMBASE (n ¼ 1567), Cochrane Library (n ¼ 38), whereas, 245 were in Chinese, and retrieved from CNKI (n ¼ 107), Wanfang (n ¼ 113), VIP (n ¼ 6) and CBM (n ¼ 19). Upon exclusion of duplicate publications (n ¼ 1045), 5997 articles underwent preliminary screening. Following the exclusion of inconsistent research types, research objectives, intervention measures and non-Chinese/English literature, a total of 413 valid publications were entered into the full-text screening process.
A total of 78.9% (217/275) of the included studies measured the outcome index of degradation, and 45.8% (126/275) measured gas formation (see Table 3). According to the published literature, most of the experiments using micro-CT and histological methods to measure implant degradation, while for the outcome index of gas formation, the most commonly used is general observation, followed by micro-CT and X-ray. In Supplementary data 4, we also provide more detailed information about the measurement of each outcome index, including the description method and presentation method of the outcome index. For example, for degradation, this index includes the presentation of specific indicators such as retention, degradation rate, corrosion rate and weight loss rate, etc.
As for the types of control groups (see Table 4), the degradable (117, 42.4%) and non-degradable metal materials (94, 34.1%) were the most commonly used positive control groups, and the negative control group (57, 20.7%) was the most commonly used among other types of control groups.
Reporting quality assessment based on the ARRIVE guidelines 2.0 Generally, the reporting quality of preclinical animal experimentations in this field was relatively poor. As depicted in Fig. 3, in terms of the report quality evaluation of Essential 10 (22 subitems), the overall coincidence rate was 42.0%. In addition, in terms of the ARRIVE Recommended Set (16 sub-items), the overall coincidence rate was 41.5%. Please refer to the Supplementary information for more details (Supplementary data 8-10).

The reporting quality of ARRIVE essential 10
The study design, sample size, and inclusion/exclusion criteria (see Fig. 4): all included studies (263, 100%) provided brief details on the study design. Over half of these studies (203, 77.2%) specified the exact number of experimental units allocated to each group, and the total number in each experiment. However, most studies (259, 98.5%) did not explain how sample size was decided, and nearly half of the studies (126, 47.9%) failed to provide an exclusion criteria after modeling.
Randomization and blinding (see Fig. 4): over half of the studies (140, 53.2%) failed to report randomization, and nearly half of the studies (119, 45.2%) reported randomization, but did not provide the method. Most studies (243, 92.4%) also failed to report blinding.
The outcome measures and statistical methods(see Fig. 4): a vast majority of studies (256, 97.3%) clearly defined all assessed outcome measures. Among the included studies, eight studies were hypothesis-testing studies, and seven (87.5%) failed to specify the primary outcome measures (i.e. the outcome measure used to determine the sample size). Overall, 197 studies employed statistical methods, most of which (186, 94.4%) provided detailed information on the statistical methods used in each analysis, and a small portion of studies (30, 15.2%) described any methods used to assess whether the data met the assumptions of the statistical method.
Experimental animals and procedures (see Fig. 4): most studies (166, 63.1%) described the detailed information of experimental animals (species, strains and sub strains, gender, age or development stage, weight, etc.) incompletely. Nearly half of studies (113, 43.0%) partially described the provenance of animals, health/immune status, genetic modification status, genotype and previous procedures, and more than half (150, 57.0%) of the studies did not include any report on the above information. Repetitive literature (n=3) Figure 1. A flow chart of the study screening and selection process.* *It is worth noting that during the research process, two searches were conducted. A total of 263 studies were included in the second research. Our screening flow chart records the second search. The research included in this study is the result of the combination of the first retrieval and the second retrieval. The first search result is in the Supplementary data 2.
In general, the reporting quality of the included research on the item of experimental procedures was rather disappointing. Among them, the reporting quality on the sub-item 'when' was more comprehensive and detailed than other sub-items. In terms of the sub-item 'what', a fraction of studies (9, 3.4%) quoted their experimental procedures from prior investigations, but the extent to which they were quoted was not specified. This includes usage of narcotic drugs, anti-infective drugs, medical devices and equipment, as well as the member of the surgical team, so they received an 'Unclear' evaluation.
Results (see Fig. 4): more than half (155, 58.9%) of the studies reported only some representative results, and some (96, 36.5%) were judged as 'Unclear' because they failed to provide an inclusion/exclusion criteria, and the number of experimental animals per group in the final analysis.

The reporting quality of the ARRIVE recommended set
The abstract, background and objectives (see Fig. 5): only a small number of studies (2, 0.8%) provided an accurate summary of the research objectives, animal species, strain and sex, key methods, main results and research conclusions. A vast majority of studies (255, 97.0%) did not fully describe the experimental animals or key methods. Most studies (250, 95.1%) failed to explain the experimental methods. In addition, only a few studies (3, 1.1%) described how the animal species and models were used to address the scientific objectives and, where appropriate, relevance to human biology. All included studies clearly described the research question, research objectives and specific research assumptions under investigation.
Housing, husbandry, animal care and monitoring (see Fig. 5): some studies (50,19.0%) provided detailed information on housing and husbandry, such as cage/housings system, lighting, temperature and humidity, and so on, 35.4% (93/263) studies briefly described the above information, and nearly half of the studies (120, 45.6%) did not describe it at all. In addition, more than half of the studies (153, 58.2%) reported in detail the animal care and monitoring measures, but few studies (88, 33.5%) were still inadequate.
Interpretation/scientific implications and generalizability/ translation (see

Discussion
Reporting quality is an important aspect that affects the translation of preclinical animal experiment Compared with traditional materials, degradable metal materials have significant advantages in the field of fracture or bone defect repair [47]. Considering the advantages and disadvantages of various materials (degradation rate [48], mechanical properties [49] and biocompatibility [50]), researchers have carried out a lot of transformation and exploration work on degradable metal materials, mainly including alloying of the degradable metals [51], coating or surface modification [52] and the production of composite materials [53]. It is clear that the research results are remarkable [54,55]. However, even though a large number of studies are in full swing, our previous work shows that the clinical translation, the inconsistency of conclusions and the quality of these studies are worrying [25,33]. There are many reasons for concern. One is the quality of research methods. Degradation is an important aspect of evaluating the in vivo performance of implants in repairing fractures or bone defects [25,33]. There are various methods to measure this outcome indicator. From the perspective of research design methods, different studies using different measurement methods at different time points will undoubtedly have an impact on the research results. For example, when evaluating degradation, descriptive general observation [52] and quantitative Micro-CT (calculation of degradation rate [56] (or corrosion rate [57]) and residual implant [58]) will certainly be different when drawing research conclusions. Another extremely important aspect is the quality of the reporting. Take degradation as an example. When measuring Due to the diversity of the measuring methods of degradation, many other measurement methods are included: CT, synchrotron radiation X-ray microscopy, high-resolution periodic quantitative computer tomography (HR pQCT), etc., we have not calculated the measurement time points of these measurement methods in a unified way. For details, please refer to our Supplementary data 4. c Many experiments generally observed gas formation on every day of the study follow-up time, but on the other hand, many studies did not elaborate on the time point of observation. Therefore, we did not uniformly calculate the measurement time of this measurement method. d In addition to the methods of measuring gas formation listed in Table 3, there are other methods used to measure this outcome index in the included studies, including: H2 sensors, syringe, MRI, etc., for the convenience of statistics, we classify this small number of measurement methods as 'Others', see the Supplementary data 4 for details. One study did not clearly indicate in the material section that the material used in the study was pure magnesium or magnesium alloy; one study did not specify in the materials section that the materials used in the study were pure iron or Fe alloy. c 'Others' include composite materials and materials that physically modify the surface of degradable metal materials. d Although the control group of five studies did not implant degradable metals into animals, based on the context, we could not determine whether fracture or bone defect or sham operation was performed; moreover, one study conducted the sham operation, and three studies did not report the material type of the control group. degradation, Micro-CT (125, 45.5%) is the most widely used method. However, the reason why this measuring method is used, and if this method is used, the reporting of residual implant [59], degradation rate [59] (or the corrosion rate [20]) is missing or incomplete will affect the reproducibility of following research on the one hand, and slacken the reliability of research results on the other hand. Ultimately, it will affect the translation of preclinical experiments.

Poor reporting quality affects the evaluation of the internal validity of findings
The internal validity refers to the scientific robustness of research design, implementation, analysis and reporting [60]. According to our research, the reporting quality of the published animal research on the repair of bone defects/fractures using degradable metal materials was low. This may lead to the controversial evaluation of the reliability of research results, meaning, the internal validity was low. This enhances the risk of translating animal research results into clinical practice [10]. The sample size, randomization, blinding, statistical methods, experimental procedures, housing and husbandry of animal research are important factors that affect the internal validity of an animal experiment.
The sample size is crucial for the evaluation of statistical model validity and the robustness of the experimental results [10]. Too-small sample sizes may produce inconclusive results, and too large sample sizes may cause waste of resources and ethical concerns. Most included studies (203, 77.2%) reported the sample size. In general, the sample size of small animals was large (the average sample size of rabbits, rats and mice were 28, 33 and 23, respectively), and the sample size of large animals was relatively small (the average sample size of dogs, sheep and pigs were 12, 16 and 9, respectively). The discrepancy in sample sizes may be attributed to considerations, such as, cost, ethics and research purposes. However, only a few studies (4, 1.5%) explained the sample size determination method and its rationality. Interestingly, other known studies involving animal studies in hepatobiliary surgery [61], peritoneal dialysis [4] and rheumatology [3] also reached similar conclusions. The lack of reporting of the sample size determination method is a common problem in animal experimentations. The description of the sample size determination method is often based on the calculation of effect size and significance level [62]. If the former is not applicable, this can also be justified according to the research objectives [10]. To enhance the precision and reproducibility of an experiment as well as the reliability of the experimental results, it is necessary to conduct both scientific and rigorous reasoning and demonstration of the feasibility of the sample size. Subsequently, a comprehensive and carefully reporting of the sample size determination is absolutely crucial.
Randomization and blinding are significant strategies that reduce bias in research design. Using randomization during group allocation ensures that each experimental unit has an equal probability of receiving a particular treatment [10]. This helps minimize selection bias and reduce systematic differences in the characteristics of animals allocated to different groups. Blinding is helpful in reducing the subjective bias in the process of intervention implementation and result measurement. Among the included studies, the reports of randomization and blinding are not encouraging. Nearly half of the studies (119, 45.2%) reported randomization, but the randomization method was either unknown or lacked sufficient details, meaning that the conclusion may exaggerate the size of the effect [63]. Similarly, a study on the reporting quality of animal models involving cardiac arrest reached the same conclusions [64]. In terms of blinding, using the animal study of bone defects with degradable metal materials as an example [25], the measurement of new bone formation and bone defect healing mainly depend on the researchers' observation of new bone formation around the implant, and bone defect healing via imaging and histological methods. If the observers were aware of the interventions received by the experimental animals prior to the analysis and/or interpretation of the results, they may have subjective measurement bias in evaluating the effect of osteogenesis or defect healing between groups, which may affect the validity of the data. Therefore, to obtain objective, accurate and reliable results, particularly those that depend on people's subjective judgment, the effective implementation of blinding is absolutely essential to avoid subjective measurement bias.
In addition, most studies (186, 94.4%) provided detailed information on the statistical methods used in each analysis. However, a large number of studies (167, 84.8%) failed to describe the methods employed to evaluate whether the data met the assumptions of the statistical approach. The lack of test of normality and/or homogeneity of variance of data distribution was pretty common [20,65]. A similar conclusion was reached in a quality reporting study involving peritoneal dialysis [4]. Hypothesis-testing is based on data assumptions. If the manuscript fails to describe how the assumptions are evaluated, and whether the data met these assumptions, readers or peers cannot fully evaluate the applicability of the employed statistical methods, which, ultimately, can affect reliability of results.
The experimental procedure is also a very crucial item in the guidelines. In the included studies, a detailed report on the experimental steps was missing. Taking the sub-item '9D. Why', which explains the experimental steps as an example, only a few studies (55, 20.9%) explained specific procedures or techniques. There may be multiple approaches to investigating and evaluating a given research problem, so it is essential to explain why a specific program or technology was selected [10]. For example, for the measurement of results, some scholars [66] explained the reason why the displacement of the two patellar fragments was estimated by measuring the total length of the patella at different time points to identify the fracture space, instead of using radiological images. This not only provided a reference for later research, but also garnered support for the formation of subsequent evidence-based evidence.

Poor reporting quality affects the external validity of a study
The external validity refers to the extent in which findings from one environment, population or species can be reliably applied to other environments, populations and species [67]. Currently, known factors threatening the external validity include experimental animal characteristics (age, sex, health status, microbiota and so on), as well as housing and husbandry [67]. Certainly, the external validity cannot be entirely improved due to species differences. However, it still possesses potentially modifiable characteristics like the representativeness of animal samples, animal husbandry, animal model characteristics and their clinical relevance [60].
In terms of the animal model characteristics and clinical relevance, researchers must select animals in relation to research objectives, compare the selected animals with human biological characteristics, fully describe the characteristics of experimental animals and extensively clarify the reasons behind the animal selection in the 'Background' section. Reporting animal characteristics is equivalent to the standardized human patient demographic data, which supports both the internal and external validity of the research [10]. Our research revealed that, at present, the reporting quality of clinical relevance of animal models in this field is not ideal, and less than half of the studies (97, 36.9%) reported the species, strain and sub-strain, gender, age or development stage, weight and so on of experimental animals in detail. However, further animal information (including animal origin, health/immune status, genetic modification status, genotype and previous procedures) were not addressed fully in any of the studies. Only a small portion of studies (3, 1.1%) described the reasons behind the selection of experimental animals.
Animal models can be divided into small (mice, rats, rabbits and so on) and large animal models (dogs, goats, pigs, sheep and so on). For ethical, economic and statistical reasons, small animal models are generally used during the preliminary evaluation of in vivo biomaterials [25,33]. However, the clinical translation function of small animal models has great limitations. Therefore, typically, when a new bone graft material is introduced in clinic, the material must have undergone translation into large animal models. In this study, only a few studies employed large animals like dogs (17, 6.2%), sheep (15,5.5%), pigs (11,4.0%) and goats (3, 1.1%) to evaluate the properties of degradable metal materials. This may result in differences in the repair effects of degradable metals in preclinical animal experimentation and clinical trials. In addition to the animal species, the sex and age of animals also have a giant impact on the external validity of the experiment. Therefore, researchers must consider factors affecting the experimental result thoroughly, report the characteristics of experimental animals comprehensively and extensively explain the reasoning behind the selection of animals in the 'Background' section.
Animal housing and husbandry are another important element that affects external validity. Taking the feeding of rats as an example [68], one of the most common incomplete information in our selected articles was the description of animal feeding, such as, 'standard rat chow and water were provided ad libitum'. This inaccurate description may endanger the external validity because different feeding may affect the results.
Poor reporting quality affects the reproducibility of animal experimentation 'Reproducibility' encompasses three aspects: reanalysis of existing datasets ('the reproducibility of analysis'); collection of a similar quantity of new data as the first experiment ('the reproducibility of experimental results'); and, the deliberate alteration of the experimental conditions or analytical methods to determine the same conclusion as before ('robustness') [69]. Preclinical research, particularly, research using animal models, is considered to be the area most vulnerable to reproducibility challenges [9]. A series of complex factors result in the lack of reproducibility, among which is the lack of transparent and accurate reporting.
A study comparing the therapeutic effects of peri-implant infection in animals and humans found that the therapeutic effects reported in animal studies were far greater than those reported in human studies. Moreover, there was a lack of homogeneity in research design and data analyses between animal studies and human studies [70]. It is likely that missing reports on important information like measurement methods, results and so on may be one of the reasons behind the discrepancy between animal and human study conclusions. This is corroborated in other disciplines, namely, rheumatology [3], peritoneal dialysis [4] and otorhinolaryngology [7]. In the field of animal experiments involving fracture/bone defects repair using degradable metal materials, the low reproducibility of animal experiments and inconsistent conclusions in existing studies is of major concern [25,33]. Our research also revealed that the reporting quality of animal experiments in this field was unsatisfactory. The reporting of important information was opaque and incomplete, which can introduce potential bias when peers conduct research based on prior investigations. Eventually, this can lead to the low quality of follow-up research reports, reduced reliability of research results, inconsistency between research results and previous studies, and reduced reproducibility. For example, in the process of evaluation, we observed that the experimental procedures of certain studies [71] were quoted from previous publications, while the basic elements of the experimental design of the referenced research itself had incomplete and opaque information, so the reliability of the obtained experimental results and conclusions from this type of research is somewhat controversial.
In short, the reporting of each item typically affects translation from animal experiments to clinical practice, from one or multiple levels at the same time. In face of insufficient and opaque information reporting, readers are likely not sure whether the reason for the incomplete information is sufficient method but insufficient reporting or whether it is insufficient method [64]. The research design, implementation and analysis of scientific research are presented in the step 'report'. Therefore, it is necessary to conduct a scientific, stable and transparent reporting to improve the internal/external validity and reproducibility, and thereby, enhance the translation of preclinical research.
Hence, we believe that, when it comes to the evaluation of the reporting quality of studies, although ARRIVE guidelines 2.0 provide a 'common' standardized guidance for the report of preclinical animal experiments, there is still a lack of 'personality' of the report of preclinical animal experiments involving degradable metal materials for the repair of fractures/bone defects. This can potentially lead to certain restrictions on the application of this guideline in the field. For example, a surface coating of degradable metal materials can markedly improve corrosion resistance [72], biocompatibility [73] and ability to stimulate new bone formation [74]. However, some crucial elements regarding the coating, such as, corrosion rate, surface chemistry, adhesion, coating morphology and controllability of degradation, were not fully discussed in numerous studies [75]. In addition, the research design involving animal experimentations in materials science must be guided by the objectives of the experiment, including, the selection of the experimental animals (such as, the selection of large animals [76] versus small animals [77]), the control group setting (such as, selection of materials in the positive control group), the determination of follow-up time (in our analyzed studies, the experimental cycle was at least 0 days [78], and the longest was 504 days [79]), the experimental results (the effectiveness [31] or safety index [80] or both [65]) and the measurement method of experimental results (e.g. the in vivo measurement of hydrogen production by magnesium includes naked eye observation [81], syringe suction measurement [82], imaging observation [83] and so on), which must be reported in detail, and the rationality and reliability must be demonstrated in the 'Background' or 'Method' section. Therefore, we suggest following the filling, improvement, expansion and extension of the ARRIVE guidelines 2.0 in the field of materials science in the future. To ensure the overall improvement of quality reporting, we also believe that we must not only focus on the complete reporting of materials and animal experimentations, but also make great efforts toward the reporting of specific diseases. For example, in terms of the preclinical animal experiment involving bone defects repair with degradable metal materials, researchers must not only ensure the transparent, accurate and comprehensive reporting of degradable metal materials and animal experiments, but also ensure the complete reporting of the bone defect itself, such as, defect site, defect size, osteotomy method and so on [84]. Moreover, this needs to be the criteria for reporting in other fields of study as well.
Being a guiding document that instructs researchers and publishers to clearly and accurately report and publish the design, implementation process and all results of medical research, the ARRIVE guidelines not only provides a basic reporting standard, but also puts forward constructive suggestions for improving the overall quality of preclinical animal experimentation reports, such as, data access, protocol registration and so on. Our research revealed that the reporting quality of protocol registration, data access (the non-coincidence rate was 100% and 89.4% respectively) and other items were low in the selected studies. The data access and protocol registration platforms are readily available [85]; however, their application is not extensive. We, therefore, suggest that all parties strictly implement the ARRIVE guidelines, and improve measures that enhances the transparency of reporting, including, data access, protocol registration and so on.

Limitations of this study
This study has several limitations. First, the purpose of this study was to assess the reporting quality of animal experiments in the field of material science. However, we did not evaluate the methodological or overall quality of our selected studies [3,5,6]. Second, the ARRIVE guidelines evaluation process is subjective. Hence, even if we conducted a pilot experiment prior to commencing our independent assessment of the reporting process, the subjective bias may not be completely avoided. Third, we only included articles in Chinese or English, so the conclusion may not necessarily be applicable to the research of other languages.

Conclusion
Our analyses suggested that the reporting quality of the published animal experiments involving fracture/bone defects repair using degradable metal materials is rather disappointing. There was a significant lack of transparent, accurate and comprehensive reporting on key elements of experimental design (such as, sample size and its calculation, randomization, blinding, statistical method and so on), as well as other elements that avoid bias (such as, declaration of interest, funding source, data access, protocol registration and so on). Given our conclusion, we suggest that it is important to further promote the use of the ARRIVE guidelines 2.0 in professional journals and advocate researchers to strictly abide by it, appropriately supplement the contents of the ARRIVE guidelines items in the manuscript review and instructions for authors [3,7,86], and draw lessons from the worldwide promotion of CONSORT statement: i.e., the ARRIVE guidelines can be promoted through expert lectures and articles and introductions in high impact journals, etc. [7,12,18]. Meanwhile, it is necessary to encourage the provision of raw online data to support papers [9], and launch the registration program of preclinical animal experiments, to improve the transparency and reproducibility of biological science research, and promote the welfare of research animals.

Supplementary data
Supplementary data are available at REGBIO online. Conflicts of interest statement. None declared.