Master Turkers: An Assessment of Data Quality

Amazon’s Mechanical Turk has greatly increased in popularity in recent years considering recent world events as well as due to the increased acceptance of technology in the field of research. Because of this, it is essential that the research methods associated with conducting research online be evaluated. The present study evaluated if Amazon’s upper echelon of workers, Master Turkers, provide a higher quality of data relative to workers without that designation. This was evaluated using two scales that are validated and have been extensively used in research. The results showed that Master Turkers were found to have worse performance on scales (lower reliability) compared to non-Master Turkers. This data highlights an issue that potential researchers should be aware of when using the Mechanical Turk, as well as problem that should be addressed by Amazon.


Mechanical Turk
One popular option for data collection is Amazon's Mechanical Turk (AMT; Amazon, 2018).AMT is an online work platform that allows users (Requestors) to create and post tasks for others (workers or Turkers) to complete for compensation.Typically, completion of these tasks, called Human Intelligence Tests (HITs), offer monetary com pensation sent directly to the Turker's account.The amount compensated is decided by the requestor and is often paid in U.S. cents, with Turker's salaries typically being around 1-3 US dollars per hour (Hara et al., 2019).Some Turkers also use the platform as a primary source of income, although those who do so are predominantly from India (Ross et al., 2010).Moreover, Turkers who use this platform as work are incentivized to complete many HITs to make a livable wage for full-time workers or additional income for part-time workers.
AMT has been used as a platform to collect data from a variety of disciplines and its usage has increased over time.In October 2022, a search was conducted on the terms "mechanical turk" or "AMT" across the databases Academic Search Ultimate, Medline, PsycInfo, and SocINDEX to demonstrate the multi-discipline usage of AMT samples.The only filter applied to the search was that the article was peer-reviewed (and there were no duplicates).The first ones found were in 2010 and the number of works found from 2010-2021 can be seen in Table 1 which shows a consistent increase over time.The use of AMT is not restricted to one academic area as it is used across a vast array of fields.Some recent examples include addictions (Mellis & Bickel, 2020), advertising (Connors et al., 2020), criminal justice (Fissel et al., 2021), geography (Kruse et al., 2021), linguistics (Ciancia & Gallo, 2021), management (Brown et al., 2021), medicine (Lee et al., 2023), pharmaceutical sciences (Lin et al., 2021), political science (Blankenship et al., 2021), psychology (Ratcliff & Hendrickson, 2021), public health (Stevens et al., 2021), and sociology (Wilbur et al., 2021).Clearly, many fields have embraced the use of AMT as a data collection platform as it does offer some obvious benefits.
One benefit of AMT is the sheer reach and diversity of populations that use it, allow ing for a greater diversity in participants available to researchers.Due to the popularity of purchasing products on Amazon, many individuals know of and use it by association, therefore leading to a large pool of participants.Previous research represented a very homogeneous makeup of Turkers, demonstrating that most are from the United States and India (Ross et al., 2010).More recent research still indicates a similar trend (Difallah et al., 2018) in country of origin of Turkers.Surveys by Difallah et al. (2018) indicate there are as many as 100,000 Turk users, with 2,000 of them being active at any one time.Due to the ubiquity of Amazon, and the convenience of AMT, large samples are easily obtainable and some cross-cultural studies can be conducted.There is also a wider varie ty of ages represented by Turkers compared to convenience samples, such as students in college (Ross et al., 2010).In addition, there is a fairly low cost of compensation that is paid to participants compared to other competitors (e.g., Prolific).Another benefit that researchers gain from using AMT that is paramount is the convenience.Data from individuals from diverse backgrounds and geographic locations can also be collected extremely quickly using the AMT platform depending on several factors related to the type of survey or research being conducted.Additionally, not all Turkers are the same as there are different categories of workers on AMT.

Master Turkers
There are subsets of Turkers that may provide higher quality data and might be more sought out by researchers.Peer et al. (2014) found that the reputation of Turkers was significantly related to the quality of data.In the AMT space, reputation is measured by approval rating which is accomplished by requestors rating the Turkers on their performance on the HITs that they completed.Peer et al. (2014) reported that high repu tation Turkers produced better data and using only high approval rated Turkers might be a viable strategy for improving overall data quality.Amazon does have functionality for selecting Turkers who have high approval and providing a different categorization for them.Turkers who complete many HITs can occasionally be "promoted" to an elevated status referred to as Master Turker.The Master Turker status is assigned to Turkers who complete various HITs and are consistently rated positively by requestors (Amazon, 2018).There is not currently a system to apply to be a Master Turker, as algorithms (not publicly available) are used to calculate some metric of number of HITs completed and ratings, leaving the actual qualifications somewhat nebulous.
With the status of Master Turkers seemingly seeking to rectify the problem of lowquality data, one would be brought to think that an overwhelming majority of research ers would use them as their sample population.However, there are several reasons why one may not do this, the first of which being the extra fee for using exclusively Master Turkers.Amazon charges an additional 5% for using the Master Turker qualification when selecting who the requestor's HIT is shown to (Amazon, 2018).Rouse (2020) completed two studies to evaluate Amazon's (2018) claim that Master Turkers provide higher quality data and are therefore worth the additional 5% premium.A series of two studies were conducted; the first experiment showed no difference between Masters and non-Masters on a personality assessment, whereas the second study used a 1-tailed test to determine if Masters produced higher reliability estimates on a cognitive ability test, which was not supported.In fact, if a 2-tailed test was used it would have shown a significant pattern in the opposite direction.The latter finding calls into question the claim from Amazon that Master Turkers should be compensated more because they provide higher quality work.The question to be addressed is if Master Turkers provide a different quality of data.
One explanation of differential quality of data among Turkers may be due to a sub sample of Master Turkers who complete a large number of HITs as they optimize ways to complete tasks.Harms and DeSimone (2015) found that samples of these workers contribute a disproportionate amount of all HITs completed on AMT.Chandler et al. (2014) found that the top 1% of Turkers completed 11% of all HITs on the platform.This may be detrimental due to the increased likelihood of previous exposure to many survey and experiment paradigms as those Turkers become privy to them.Ford (2017) referred to speeders and cheaters which are Turkers who are incentivized to maximize the amount of money they can earn by speeding or skipping through HITs at a fast rate in order complete the task(s) and earn the reward while not providing accurate data.Harms and DeSimone (2015) described Superturkers as a group who spend an inordinate amount of time on AMT in an attempt to optimize their daily HIT completion rates.Thus, if participants are trying to complete a high number of HITs, the issue of data quality that should be considered among Turkers is attention (Buhrmester et al., 2018).
With these issues compounding over time, Amazon may experience less requestors.As an example, Chmielewski and Kucker (2020) performed a study that examined Turker performance on tasks over time which indicated there may be an AMT crisis in relation to data quality.They reported a pattern of failing response validity indicators, worse psy chometric properties, and the inability to replicate well established findings over time.Their interpretation was that data quality was decreasing over the timeframe in which the study was conducted which included four rounds of data collection.Furthermore, some journals, editors, and reviewers have rejected manuscripts on the basis of using an AMT sample regardless of the study design and outcome (e.g., Landers & Behrend, 2015;Walter et al., 2019).Concerns over the frequency in which Turkers are exposed to certain experimental paradigms, motivation to achieve compensation, selection bias (Landers & Behrend, 2015) and concerns over the measurement properties and characteristics of online samples like this (Walter et al., 2019) are some of the reasons why studies using populations like the Turk have been rejected.Amazon's policies and suggestions regarding requestors and Turkers clearly prioritize quantity over quality and it will be interesting to track the trajectory of research conducted using AMT samples.
Data quality is a composite of several facets relating to the collection of data, specifi cally, accuracy, which can be conceptualized as avoiding errors while collecting the data (Herrera & Kapur, 2007).For this study we will be analyzing this concept by comparing the difference in the act of straight lining (selecting the same answer for the entirety of a scale), which would indicate that the person likely isn't accurately reporting their scores, completion time, and the Cronbach's alpha reliability of differently coded scales in a study design similar to Rouse (2020).Rouse (2020) based this analysis design on a comparison of an in person sample lab sample and an MTurk sample conducted by Johnson and Borden (2012).This data quality is analyzed in the context of psychology survey tasks as this is a very common usage for the MTurk population.
The number of studies using AMT has continued to grow over time (see Table 1), yet there have been some more recent studies (e.g., Aguinis et al., 2021;Rouse, 2020) that have identified some red flags regarding data quality.The present study sought to further explore the relationship between data quality and Master Turker status to see if Amazon's algorithm of categorizing Master Turkers is related to higher quality data.Specifically, the question the study sought to investigate was if there are differences between Master Turkers and non-Master Turkers on data quality measures such as Cronbach's alpha reliability scores on frequently used survey instruments, completion time, and poor survey taking behavior such as straight lining through the survey.This was examined by administering two commonly used scales; one scale had 10-items that were anchored similarly, whereas the other scale had half of the items reverse coded.

Method Participants
Demographic data was not collected for this survey; rather, information on participants' AMT-related behaviors were of interest in describing the sample.Our sample size was calculated based on Rouse (2020) which used Bonett's (2003) recommendation for testing the significance between two reliability estimates with a one tailed significance of .05 with a power of .80 and range of reliability estimates from .70 to .85.We decided to oversample to further increase statistical power.Overall, 320 participants took the survey, and 309 were analyzed after removing incomplete data.Table 2 provides the relevant characteristics of the Master Turkers while Table 3 provides data for non-Master Turkers.In summary, 50% of participants self-reported being Master Turkers and 49% reported they were not.At the time of writing, there is no option to exclude Masters from the participant pool on MTurk, only to exclude non-Masters.English was the primary language by all but one participant.Approximately half of participants were Turkers for less than one year.Three out of four participants reported being full-time Turkers and over 40% reported completing more than 40 HITs per week.

Instruments Rosenberg Self Esteem Scale
A 10-item scale that measures self-esteem by assessing positive and negative feelings about the self (Rosenberg, 1965).Items on the scale use a 4-point Likert scale format ranging from strongly agree to strongly disagree.Scores on the scale range from 10-40, with higher scores indicating higher self-esteem.One notable aspect of this scale is that several of the items are reverse coded (2,5,6,8,9), where a strongly disagree will indicate a higher self-esteem.The Rosenberg Self-Esteem Scale is a widely used and validated scale (Gray-Little et al., 1997).This scale was selected because of its widespread use and solid validation and is also relatively brief.Many analyses have been performed on the Rosenberg self-esteem scale which consistently provides respectable psychometric properties (e.g., Schmitt & Allik, 2005;Sinclair et al., 2010).

PANAS Scale -Positive
The 10 positive items from the Positive and Negative Affect Scale were used to determine participants feelings of positive affect (Watson et al., 1988).PANAS uses a 5-point Likert scale ranging from 1 (very slightly or not at all) to 5 (extremely), and total scores can range from 10-50, with higher scores indicating higher amounts of positive affect.This scale was selected for its brief nature, as well as having 10-items of non-reverse coded items which is a reasonable comparison to the 10-item reverse coded self-esteem scale.
Similarly, many analyses have consistently shown favorable psychometric properties of the scale across a wide range of samples (e.g., Crawford & Henry, 2004).

Procedure
A survey was created using the software Qualtrics to be distributed through Amazon's AMT platform.The survey consisted of AMT use related questions, the two scales provided in a counter balanced manner, and some exploratory questions aimed to serve as the basis for a future study.Pilot testing of the survey consisted of 13 individuals and had a range of 4 minutes 17 seconds and 34 minutes 56 seconds, with an average completion time of 9 minutes and 56 seconds.With upper outliers removed, the average completion time of the pilot study was 5 minutes 31 seconds.Participants were awarded 0.12 USD upon completion of the survey as a result of the pilot study taking just over 5 minutes to complete.It was reasoned that 0.10 USD would be too little for over 5 minutes, so 0.12 USD was selected to reflect better value for the participant's time.The survey was limited to AMT users from the United States and was written in English.
Participants had a completion time ranging from 48 seconds to 89 minutes 6 seconds.Average completion time for the survey was 4 minutes and 16 seconds.The survey purposely did not ask any personal demographic questions, and instead focused on questions related to AMT usage and behavior.The study was fielded for 8 hours from approximately 9 am to 5 pm EST.The primary goal of this study was to compare data quality for Master Turkers and non-Master Turkers.This was primarily assessed by comparing the Cronbach's alpha scores of these two groups on two scales, one reverse coded and one not.Cronbach's alphas were compared using the methodology developed for an online research environment by Diedenhofen and Musch (2016) which was based on previous methodology developed in Feldt et al. (1987).This method employs the use of a chi-square test to compare Cronbach's alpha scores.This differs from the typical benchmark comparison to .7 and allows for the comparison of two Cronbach's alphas relative to each other, based on sample size and number of items on the scale you are measuring the Cronbach's alpha of.

Results
To asses completion time an independent sample t-test was conducted and found Masters (M = 280, SD = 497) were not significantly different, t(207.1)= 1.54, p = .126,than non-Masters (M = 213, SD = 207) on the average amount of time it took them to complete the survey.A non-parametric test was also used to evaluate completion time, and it was found there was also no significant difference in completion time based on Master status, H(1) = .576,p = .448.A chi-squared test determined there was no significant difference between Masters and non-Masters on frequency of straight lining behavior, χ 2 (2, N = 306) = 8.07, p = .045.Table 4 contains data on the frequency of straight lining.A one-way MANOVA was conducted to examine the difference in mean scores between Master and non-Master Turkers.There were two dependent variables: score on the Rosenberg Self Esteem Scale and score on the positive PANAS scale.There was a significant main effect of whether someone was a Master Turker or not on the scores that participants obtained on the scales, F(1, 304) = 8.58, p < .001.Individual ANOVAs found a significant level of difference on the Rosenberg scale, F(1, 304) = 7.03, p = .008,such that Masters, M = 24.36,SD = 3.64, were higher than non-Masters, M = 23.07,SD = 4.84, as well as on the PANAS positive, F(1, 304) = 7.99, p = .005,where again Masters, M = 39.54,SD = 6.29, were higher than non-Masters, M = 37.22, SD = 8.00.The Rosenberg scale demonstrated acceptable internal consistency for non-Master Turk ers (α = .76),and unacceptable consistency for Master Turkers (α = .34).A Chi-squared test was used to determine whether these values differ significantly, χ 2 (1, N = 304) = 30.05,p < .001.The PANAS scale demonstrated acceptable internal consistency for Mas ter Turkers, α = .73,as well as for non-Masters, α = .82.A Chi-squared test determined that these values differ significantly, χ 2 (1, N = 304) = 5.00, p = .025.

Discussion
The data in this study revealed some surprising findings surrounding the performance of Master Turkers on completing commonly used instruments.Master Turkers had signifi cantly less reliable data than what was provided by general Turker samples.This runs contrary to Amazon's (2018) claim that Master Turkers provide higher quality data.It should also be mentioned that the general Turker sample yielded reliability coefficients within the range that has been consistently reported (e.g., Crawford & Henry, 2004;Schmitt & Allik, 2005).Because of the premium associated with the use of Master Turkers and the seemingly worse data quality associated with them, the findings of this study suggest it may not be worthwhile to limit surveys to only using Master Turkers in studies that use instruments similar to the ones used in this study.In fact, the results suggest that the general Turker population provides significantly higher quality data for the two short instruments that were used in the study.
The design of the present study was to intentionally compare two instruments with the same number of items, differing on the basis of being reverse coded or not.Cron bachs alpha scores on the PANAS are frequently around the high .80s(Carvalho et al., 2013;Serafini et al., 2016;von Humboldt et al., 2017;Watson et al., 1988), of note the Master Turker population, α = .73,had a significantly lower Cronbach's alpha score than the non-Masters, α = .82,however both are still in the range that is conventionally considered acceptable.For the Rosenberg scale, a study across 53 different countries found an average Cronbach's alpha score of .81(Schmitt & Allik, 2005).This is higher than both masters (.34) as well as non-Masters (.76).Non-Masters, however, have an acceptable Cronbach's alpha, whereas the Masters have an alpha that is considered to be far below what is acceptable.Clearly, Master Turkers completed the scales differently than the non-Master Turkers, particularly the Rosenberg with the reverse coded items.The low reliability for the Rosenberg stands in stark contrast to its widely used and accepted nature (Schmitt & Allik, 2005), and serves as a good indicator that the Master Turker population did not provide high quality data.Master Turkers are paid more than non-Master Turkers and have more experience, so it is important to consider factors that might be associated with their poor performance on these tasks.
The observed low level of reliability among Master Turkers may have a number of possible reasons to explain the findings.One consideration is that many Turkers use AMT as a full-time job (Ross et al., 2010) which was found in our study as roughly three out of four respondents indicated they did AMT full-time.Thus, it is within a full-time Turker's best interest to complete as many HITs as possible to maximize the amount of money they make in a given period of time.If a Turker is more motivated to complete a high quantity of HITs this would lead to justifying cheating, and speeding as Ford (2017) names them, to achieve this quantity.To compound this issue, if the requestor is conducting research alone with high quantities of participants, it can be challenging for them to validate and identify every individual participant and approve their response in the given timeframe.Additionally, Amazon (2018) advises the requestor to not reject work often, and that it is inappropriate to penalize a worker because of unclear instructions that the requestor provides.The reasoning provided is that Amazon is aware of how important Turker approval ratings are and will actively avoid requestors who are seen as harsh or unfair.With these statements officially posted on Amazon's approval guide for requestors, it encourages requestors to be lenient and provides an incentive system for Turkers to do as many HITs as possible within a given timeframe.This is problematic for several reasons including that it is encouraging requestors not to reject responses, because it will lead to them not having any Turkers willing to complete their HITs.This creates a cycle for these high-volume and efficient Turkers to have highly positive approval ratings, allowing them to be eligible for Master status and therefore higher pay.For Turkers who do this as a full-time job, this is the end goal, but for researchers this is a problem as it can be argued that AMT's policy suggestions do not optimize data quality.
Another factor that could be contributing to improving optimization of efficiency among frequent Turkers is their familiarity with common psychological scales and research paradigms.Due to the high volume of HITs that these high-volume Turkers complete, there is a modest likelihood that they have been exposed to a wide variety of scales and psychological paradigms.Because of this non-naiveté, these Turkers will complete these tasks much more quickly, and in a way that they believe is expected of them (Chandler et al., 2014).This is also another unfortunate downside of Turkers who complete a high number of HITs, because they have often been exposed to these scales and tests multiple times, they may be familiar with the measurement that is trying to be assessed and may answer in ways that they believe the scale or paradigm is supposed to be answered instead of answering truthfully.
This, however, does not indicate that Master Turkers are faster at completing individ ual surveys, as Masters and non-Masters were found not to differ significantly on the time it took them to complete the survey.This is interesting to note as previous research (Ford, 2017;Harms & DeSimone, 2015) seemed to indicate that those who achieve Master Turker status would be faster than non-Masters.The opposite is found in the present data, indicating on average, Masters took slightly more time taking a mean of 280 seconds to complete the survey, compared to non-Masters taking a mean of 213 seconds.However, the masters varied more as a population, indicated by the larger standard deviation.
A third possible explanation related to the prior two possible issues is that of attention.Attentiveness can vary throughout all research populations including AMT, however, in an online space, there exists many more distractors than in a controlled laboratory or classroom setting and the rapid speed at which MTurk users complete HITs to maximize monetary output may negatively impacts attention (Aguinis et al., 2021).Chandler et al. (2014) found that most Turkers reported that they completed HITs alone in their own home, they also reported doing other activities simultaneously such as watching television, listening to music, or instant messaging/texting.Thus, it is important to assess Turkers level of attention.One mechanism of doing so is by using some type of validity checks within the study.Oppenheimer et al. (2009) explored a possible solution to this lack of attention in general research populations by employing an instructional manipulation check (IMC), which mimics the format and length of other survey questions, but instead seeks to assess whether the participant is reading and interpreting a question.The idea behind the IMC is for the participant to fully read the question and ignore the response pattern that is typical for the rest of the survey.Throughout the study it was discovered that IMCs failure rate depended on the further context of the survey, including when the IMC was presented to the participants.It was also noted that even if the IMC question format did not fit with the context of the question asked, a small percentage of participants still failed the IMC.One such strategy is removing those who fail the IMC altogether to increase statistical power.Another strategy to increase statistical power is by forcing participants to pay more attention by prompting them with the IMC, thereby priming them to read more throughout the survey.
A recent investigation (Aguinis et al., 2021) found that 15% of Turkers failed attention checks and were also likely to exhibit a myriad of behaviors that indicate a lack of atten tion.Interestingly, Lovett et al. (2018) surveyed Master Turkers, of which 70% believed their data to be of very high quality whereas the other 30% indicated it was high quality.
In addition, the study had a qualitative component and Master Turkers reported that the factors that were related to high quality data were: higher compensation based on time, attention checks used in the HIT, more experience, higher reputation, multiple choice over writing, clear directions, and clean formatting.Unfortunately, the present study did not support the views reported by Lovett et al. (2018).
In contrast, a different pattern may exist for non-Master Turkers.Specifically, those Turkers may be working toward a Master Turker designation.The best way to achieve that status is to complete many HITs, but to do so well.Although AMT does not provide the algorithm or criteria for achieving Master Turker status, it is a reasonable assumption that part of the calculus involves some combination of number and accepted HITs.In other words, they have to complete quite a few tasks and receive strong ratings over some period of time.Achieving a high rating will most likely be related to successfully completing the tasks (i.e., HITs) and following the directions carefully.
This creates an interesting issue for researchers as well as for Amazon.For research ers, results from the present study might serve as a caution with regards to instruments to use (or not use) if using Master Turkers.It appears as though non-Master Turkers per formed as samples drawn from other populations.Consider the possibility of a researcher limiting the respondents to Master Turkers which may result in paying more for lower quality data.Then, researchers may be faced with the daunting task of determining "which" data to keep.
A recent paper (Aguinis et al., 2021) provided a thorough summary of the benefits and validity threats associated with using Turkers as participants in a study.The authors provided a list of four benefits, ten threats as well as ten steps to consider when conduct ing a study on AMT.Of particular relevance to the present study was step eight which was labeled "screening data" which of course, is important in all studies.Aguinis et al. ( 2021) mentioned particular concerns in data screening involved BOTS, high attrition, and inattention with the possible remedies of attention checks, checking response times, estimating the number of useable responses prior to the study so that one oversamples knowing some data will be deleted, and examining response patterns.Data may need to be deleted, which then the findings (or lack thereof) may come into question.It is true that this dilemma is present in most types of research when data is collected from humans, but the AMT platform has additional nuanced issues to deal with and resolving some of the relevant issues may become more pronounced over time as technology advances in this domain.
The MTurk population could also potentially be less representative than it appears based on demographic information.Studies have demonstrated that the average Turker is not very representative in age and may in fact be rather unusual in the level of educa tional attainment as well (Difallah et al., 2018;Redmiles et al., 2019;Ross et al., 2010).It is suggested to counteract this by being more selective with the options that MTurk provides, additional costs will be incurred but the data will be more representative (Zack et al., 2019).
Although this study had interesting findings, it does have limitations to consider.First, one limitation the present study faced was the demographic which consisted of only Turkers who indicated they lived in the United States, which may or may not be true.Much of the data surrounding AMT usage indicates the stark differences in the American and Indian populations that predominantly make up the userbase of the Turk (Ross et al., 2010).A second limitation involves the scales that were used.Both scales were brief 10-item scales that are well known.And, although that is a useful comparison, many research studies use scales with many more items which may provide different results.Both scales measured what might be considered dimensions of personality so scales tapping other domains would also be of interest.Third, minimal screening was done to recruit participants in AMT.It may be possible that more rigorous screening methods would result in different outcomes.

Conclusions
The results of the present study provide a cautionary tale for potential requestors using AMT which is to prepare to screen data if Master Turkers are completing instruments with reverse coded items.This also highlights an issue that Amazon would benefit from addressing, as potential customers may be disincentivized to post their surveys on AMT due to questionable reliability from Masters at a higher price than non-Masters.It may be the case that this platform may require more methodological scrutiny and rigor than other data collection methodologies, particularly when using samples that may be somewhat risky with regards to the samples.When feasible, it may be worthwhile to conduct a form of replication or some comparison to a sample obtained from a different platform or setting.Certainly, AMT is not going away anytime soon but there are clear methodological issues to consider when using it for collecting data.

Table 1
Number of Publications Using AMT by Year

Table 4
Straight Lining Behavior by Turker Status and Scale