Validating Grading of Aesthetic Outcomes of Web Space Reconstruction for Finger Syndactyly: Crowdsourcing Public Perceptions Using Amazon Mechanical Turk

Abstract Background It has recently been attempted in the literature to analyze the aesthetic outcomes of syndactyly web space reconstruction utilizing dorsal pentagonal advancement flaps and dorsal rectangular flaps with skin grafting. The study utilized a categorical grading system for evaluating the aesthetic outcomes of reconstruction to be used in conjunction with a visual analog scale (VAS), which has yet to be validated in the assessment of aesthetic outcomes following web space reconstruction. Objectives To utilize crowdsourced public perceptions to validate the grading of aesthetic outcomes in web space reconstruction for finger syndactyly. Methods A prospective study was conducted of random volunteers recruited through an internet crowdsourcing service to gain responses for a survey to analyze patient opinions toward the aesthetic outcomes of web space reconstruction. Outcomes were graded based on descriptions of the appearance, color, matte, and distortion of the reconstruction. Results The excellent dorsal flap demonstrated a mean VAS score of 6.66 (95% confidence interval [CI] = 6.45-6.87), and the very good, good, and poor dorsal flaps had mean VAS scores of 5.94 (95% CI = 5.73-6.15), 4.98 (95% CI = 4.77-5.19), and 3.55 (95% CI = 3.31-3.79), respectively. The odds ratio for receiving an excellent rating was 4.21 (95% CI = 3.04-5.82) for excellent dorsal flap with P < 0.0001. Conclusions This study confirms and validates the assessment of aesthetic outcomes of web space reconstruction by the Yuan Grading Scale. This evidence may guide future practice such that recommendations can be made to align with the aesthetic preferences of the patient.

][7] Recently, a study by Yuan et al 7 attempted to analyze the aesthetic outcomes of syndactyly web space reconstruction utilizing dorsal pentagonal advancement flaps and dorsal rectangular flaps with skin grafting.Within their study, the authors utilized a categorical grading system for evaluating the aesthetic outcomes of reconstruction to be used in conjunction with a visual analog scale (VAS).[10] However, this grading system has yet to be validated in the assessment of aesthetic outcomes following web space reconstruction.
The purpose of the current study was to utilize crowdsourced public perceptions to validate the grading of aesthetic outcomes in web space reconstruction for finger syndactyly.We sought to utilize Amazon Mechanical Turk (MTurk) to acquire a crowdsourced assessment of aesthetic outcomes utilizing global assessment and individual components of the categorical grading system.As there is no current grading system that has been universally utilized in the assessment of aesthetic outcomes following web space reconstruction, we aimed to use objective crowdsourced evaluators to assess validity, reliability, and feasibility of the VAS and a categorical grading system.

METHODS
2][13] Several studies have demonstrated that the worker population is extremely representative of the US internet population. 13,14This technique has previously been reliably utilized in the plastic surgical literature. 15,16orkers are provided with a level of compensation and estimated time of completion and are screened by Amazon for quality responses.We did not allow workers with lower than a 5-star worker rating (the maximum possible score) from participating in the survey.This study was exempt from Institutional Review Board approval, given that this study utilized deidentified survey data, however informed consent was provided by the workers through their contract with Amazon MTurk.
MTurk workers are required to be above the age of 18 and registered through the Amazon service platform to prevent individuals from multiple survey responses.Surveys were open to 200 people at a time for approximately 24 hours (repeated 5 times), and workers were paid $0.05 per unique response.This allowed us to screen for quality, completeness, and duplicate users before proceeding to collect more data.The survey was created by authors C.K.M. and O.S., with permission provided to include images from the study of Yuan et al 7 (Supplemental Appendix).The images included in the survey represent the outcomes of both simple and complex web space reconstruction, with a primary focus on the aesthetic qualities (Supplemental Figures 1-4).

Screening Questions
Although MTurk requires that registered volunteers be above the age of 18, individuals may not be completely truthful when creating their account.To ensure that all surveyed participants were considered adults, the first questions of the survey asked the participants to reenter their age.Any response below the age of 18 immediately disqualified the worker.No other screening questions were administered to maintain a truly diverse representation of the general US population.

Attention Check Question
To ensure that survey participants were paying close attention to each question and scenario, and to also ensure that the generated data was a valid representation of patient opinions, the following attention check question was included approximately halfway through the survey: "You opt to undergo a novel surgical procedure that may completely heal your injury with minimal postoperative pain.You will answer 72 exactly to this question regardless of how you feel about this scenario.There is a high chance the surgery will work, but if it does not, you will require much more extensive surgery and will have limited wrist function." Respondents who entered a number anything other than "72" were directed to the end of the survey and were excluded from this study.Those who were excluded were prevented from ever taking this survey again.

Preference Questions
Crowdsourcing was utilized to gain responses for a survey to analyze patient opinions toward the aesthetic outcomes of web space reconstruction.Outcomes were graded based on the findings from Yuan et al 7 and included excellent, very good, good, and poor.These outcomes were judged based on 4 categories including a description of the appearance, color, matte, and distortion.Included in the survey conducted was an overall utility score (vertical VAS scale) ranging from 0 to 100 with 0 representing an indistinguishable finger and 100 representing a perfect reconstruction.These utility scores represented interval survey data and were then analyzed across both treatment options and all patient-reported demographics.

Data Analysis
Data from the survey were pooled and assessed using Microsoft Excel 2016 (Redmond, WA).Statistics were performed using Stata (College Station, TX) with continuous data evaluated using 2-tailed 2-sample unequal variances t-tests (significance at alpha = 0.05).

Patient Characteristics
A total of 590 MTurk participants were interested in the survey.However, 150 (25%) of these were excluded due to either failing to meet inclusion criteria (1) or failing to fully complete the survey (149).Therefore, the 440 participants who met the inclusion criteria (properly answered screening and attention check questions) and completed the survey were included in this study.This screening methodology ensures that data are derived from those participants who were fully attentive to the survey materials.

Study Demographics
The demographics of participants in this survey can be found in Table 1.The majority of our survey participants were between the ages of 25 and 34 (56.6%) with 83% of participants between the ages of 18 and 44.Females and males comprised of 55% and 45% of participants, respectively, indicating both sexes were roughly equally represented in this study.By race, the majority of participants were White (56%) followed by Asian (22.5%).Nonwhite Hispanics, African Americans, American Indians, and Pacific Islanders made up the remaining 21% of participants.Finally, each annual income cohort was appropriately represented with at least 10% of participants in each income cohort except for the >$100,000 cohort (7%).
With respect to the general population, the exact Fisher's test was performed in order to compare the study population to the 2019 national consensus data in the United States.It was found that the age distribution of this study (Fisher's exact test value = 0.59) and gender distribution (Fisher's exact test value = 0.67) were not significantly different than that of the general US population at a P < 0.05.However, the race demographic information was statistically significantly different than that of the predominantly Caucasian general US population (Fisher's exact test value = 0.004).Furthermore, 2018 national consensus data demonstrated a significantly larger percentage of families with an annual household income of greater than $100,000 (38.4%, Fisher's exact test value < 0.00001).

Aesthetic Outcomes
Study participants were asked to evaluate 4 dorsal reconstruction flaps predesignated as excellent, very good, good, or poor for the overall grade and various aesthetic criteria (Figure 1).The participants were blinded to the predesignated grade assigned to each flap.Table 2 illustrates the categorical criteria included in the survey.For overall grade, each participant was asked "On a scale of 0-10, where 0 represents an indistinguishable finger (does not look like a finger at all), and 10 represents a perfect looking finger/hand, how would you rate the overall appearance of this child's fingers/ hand?"For overall grade, participants were asked "Using the following criteria, and based on your answers to the above questions, please provide an overall grade to this reconstructed finger.Excellent = Equal appearance to surrounding skin in color, matte, and no skin distortion; Very Good = Similar appearance to surrounding skin with mild skin distortion; Good = Shiny appearance compared to surrounding skin with moderate skin distortion; Poor = Obvious scare with severe skin distortion."For scar quality, participants were asked "Please note if the skin overlying the fingers and/or scar is 'matte' (not shiny) or 'shiny.'"For skin color match, participants were asked "Please describe the skin color of the reconstructed fingers compared to the skin color of the rest of the child's hand."For skin deformity, participants were asked "Please describe how distorted the reconstructed hand/fingers look compared to what you believe a normal hand/fingers should look like."Overall, the participants gave the excellent dorsal flap a mean VAS score of 6.66 with a 95% confidence interval (CI) of 6.45-6.87(Figure 2).Meanwhile the very good, good, and poor dorsal flaps were given mean VAS scores of 5.94 (95% CI = 5.73-6.15),4.98 (95% CI = 4.77-5.19),and 3.55 (95% CI = 3.31-3.79),respectively.One-way t-test for difference of the means between the excellent flap and each of the very good, good, and poor flaps revealed P < 0.0001 in each case.Additionally, single-factor ANOVA analysis of the VAS scores for the 4 reconstructive flaps resulted in a P-value < 0.0001.
The analysis of VAS scores based on age, sex, race, and annual income is summarized in Table 3. Males and females gave the excellent dorsal flap VAS scores of 6.38 (95% CI = 6.07-6.70)and 6.89 (95% CI = 6.61-7.17)as compared with 5.75 (95% CI = 5.44-6.05)and 6.10 (95% CI = 5.81-6.39) to the very good dorsal flap.The 18-24 age range gave a higher VAS score to the excellent dorsal flap than the very good dorsal flap, 6.53 (95% CI = 6.26-6.80)as compared with 5.78 (95% CI =5.50-6.06).All other age cohorts gave a higher VAS score to the excellent dorsal flap than to the good dorsal flap.White participants were the only race that gave a higher VAS score to the excellent dorsal flap (6.99 with 95% CI = 6.71-7.27)than the very good dorsal flap (6.17 with 95% CI = 5.89-6.45).Lastly, the $50,000-$74,999 income range also gave a higher VAS score to the excellent dorsal flap than the very good dorsal flap.
Participants also rated each reconstructive flap on the categorical criteria of the overall grade, scar quality, skin color match, and skin deformity.Table 4 exhibits the odds ratios of giving the best rating in each category when evaluating an excellent dorsal flap as compared with a non-excellent dorsal flap (very good, good, or poor).The odds ratio for receiving an excellent rating for the overall grade was 4.21 (95% CI = 3.04-5.82)for excellent dorsal flap with P < 0.0001.The odds ratio for receiving a matte rating for scar quality was 5.20 (95% CI = 4.00-6.75)for excellent dorsal flap with P < 0.0001.The odds ratio for receiving a perfect rating for skin match was 4.23 (95% CI = 3.19-5.62)for excellent dorsal flap with P < 0.0001.The odds ratio for receiving a perfect rating for skin match was 3.16 (95% CI = 2.32-4.29)for excellent dorsal flap with P < 0.0001.

DISCUSSION
6][7] To date, no studies have attempted to provide the validation of grading aesthetic outcomes of web space reconstruction.This study demonstrates that public opinion aligns with the aesthetic evaluation of dorsal flap reconstruction of syndactyly established by Yuan et al. 7 VAS scores were consistently and statistically significantly higher for the excellent dorsal flap than for the very good, good, and poor dorsal flaps.As expected, there was a stepwise increase in VAS scores from poor to excellent dorsal flap.However, these increases did not correlate with "perfect" VAS scores and should be utilized to understand that web space reconstruction evaluation in our sample population decreases within a smaller spectrum of the VAS scale.Thus, "excellent" outcomes have lower than expected VAS scores, and "poor" outcomes have higher than expected scores.Additionally, there were large odds ratios for receiving the best grade in overall grade, scar quality, skin color match, and skin Aesthetic evaluation of web space reconstruction has remained limited to date with much of the previous literature focused on the technical aspects of reconstruction.Lumenta et al 6 utilized the Vancouver Scar Scale and assessment of web creep to demonstrate favorable long-term outcomes for simple syndactyly reconstruction.However, the authors did not comment on the overall appearance of the web space reconstruction and did not apply any specific aesthetic grading tools outside of the Vancouver Scar Scale.Goldfarb et al 5 additionally utilized the Vancouver Scar Scale as well as patient and surgeon visual analog scores to evaluate the aesthetic outcomes of web space reconstruction.The authors demonstrated that while surgeons had high-rated appearance VAS scores, patients and families reported lower VAS scores, indicating better aesthetic outcomes.This finding suggests that surgeons may be more critical than patients and families when evaluating the aesthetic outcomes.
Recently, Yuan et al 7 utilized a modified version of the Manchester Scar Score to evaluate long-term follow-up of multiple aspects of web space reconstruction, including overall grade, description, color, matte, and distortion.Our study attempted to further this evaluation by utilizing crowdsourced opinions on Amazon MTurk in an effort to validate this aesthetic evaluation scale.Crowdsourcing has previously been demonstrated to be invaluable in assessing  the perception of aesthetic outcomes. 15,16By crowdsourcing, this study provides evidence of public perceptions of the aesthetic outcomes of dorsal reconstructive flaps in syndactyly web space reconstruction.Furthermore, this use of large sample sizes allows for validation of the modified Manchester Grading System utilized in the study of Yuan et al. 7 These findings may allow for standardization and simplification of aesthetic outcome evaluation.While there are many strengths to this study methodology, several limitations also exist.Inherent to many surveying methodologies is the bias that exists among individuals who electively chose to take this survey.Those with a history of syndactyly or reconstructive surgery either directly or familiar with friends and family may have been more likely to start and complete our survey.In an attempt to avoid this bias, the survey title and goals were not provided to study participants.Furthermore, the outcomes of both simple and complicated reconstructive cases were included in this survey; however, this endorses the generalizability of the grading scale utilized.Despite these potential limitations, MTurk remains a commanding tool for surveying the general US population as an indicator of patient sentiment toward surgical treatment options.Our study is unique in that it offers validation of an aesthetic grading system for web space reconstruction.Furthermore, it is the first study to utilize crowdsourced opinions to evaluate these aesthetic outcomes in an attempt to better characterize the grading scale utilized.We found that the classification system was feasible, reliable, and valid when evaluating the aesthetic outcome outcomes of web space reconstruction.These findings provide surgeons with a readily available tool that can be completed by surgeons, patients, and family members to evaluate postoperative outcomes.

CONCLUSIONS
This study confirms and validates the assessment of aesthetic outcomes of web space reconstruction previously investigated.By crowdsourcing survey results, our study attempts to eliminate bias and gain a broad perspective of aesthetic outcome evaluation.This evidence may guide future practice such that surgical recommendations can be made that align with the aesthetic preferences of the patient population.Future prospective studies utilizing this grading system to compare different techniques of webspace reconstruction are needed to better characterize the aesthetic outcomes.

Figure 1 .
Figure 1.Schematic of categorical criteria evaluated by the grading system.Participants were asked to evaluate an overall grade, scar quality, skin deformity, and skin color match.

Figure 2 .
Figure 2. Mean visual analog scale (VAS) scores reported with 95% confidence intervals.Mean VAS scores with 95% confidence intervals are shown for each repair used in the survey.One-way t-tests were performed for the difference in the mean between the dorsal rectangular flap and each pentagonal advancement flap (*P < 0.001).

Table 1 .
Demographics of All Study Participants Who Were Eligible and Completed the Survey (N = 440)

Table 2 .
Categorical Grading System Utilized by Yuan et al

Table 3 .
Mean Visual Analog Scale Scores for the "Very Good," "Good," and "Poor" Dorsal Pentagonal Flap Images Provided to Study Participants

Table 4 .
Categorical Grading of Dorsal Rectangular and Pentagonal Flaps