Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review

Background: A growing amount of health research uses social media data. Those critical of social media research often cite that it may be unrepresentative of the population; however, the suitability of social media data in digital epidemiology is more nuanced. Identifying the demographics of social media users can help establish representativeness. Objective: This study aims to identify the different approaches or combination of approaches to extract race or ethnicity from social media and report on the challenges of using these methods. Methods: We present a scoping review to identify methods used to extract the race or ethnicity of Twitter users from Twitter data sets. We searched 17 electronic databases from the date of inception to May 15, 2021, and carried out reference checking and hand searching to identify relevant studies. Sifting of each record was performed independently by at least two researchers, with any disagreement discussed. Studies were required to extract the race or ethnicity of Twitter users using either manual or computational methods or a combination of both. Results: Of the 1249 records sifted, we identified 67 (5.36%) that met our inclusion criteria. Most studies (51/67, 76%) have focused on US-based users and English language tweets (52/67, 78%). A range of data was used, including Twitter profile metadata, such as names, pictures, information from bios (including self-declarations), or location or content of the tweets. A range of methodologies was used, including manual inference, linkage to census data, commercial software, language or dialect recognition, or machine learning or natural language processing. However, not all studies have evaluated these methods. Those that evaluated these methods found accuracy to vary from 45% to 93% with significantly lower accuracy in identifying categories of people of color. The inference of race or ethnicity raises important ethical questions, which can be exacerbated by the data and methods used. The comparative accuracies of the different methods are also largely unknown. Conclusions: There is no standard accepted approach or current guidelines for extracting or inferring the race or ethnicity of Twitter users. Social media researchers must carefully interpret race or ethnicity and not overpromise what can be achieved, as even manual screening is a subjective, imperfect method. Future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers and be guided by concerns of equity and social justice.


Research Using Twitter Data
Twitter data are increasingly being used as a surveillance and data collection tool in health research. When millions of users post on Twitter, it translates into a vast amount of publicly accessible, timely data about a variety of attitudes, behaviors, and preferences in a given population. Although these data were not originally intended as a repository of individual information, Twitter data have been retrofitted in infodemiology to investigate population-level health trends [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Researchers often use Twitter data in consort with other sources to test the relationship between web-based discourse and offline health behavior, public opinion, and disease incidence.
The appeal of Twitter data is clear. Twitter is one of the largest public-facing social media platforms, with an ethnically diverse user base [16,17] of more than 68 million US Twitter users, with Black users accounting for 26% of that base [18]. This diverse user base gives researchers access to people they may have difficulty reaching using more traditional approaches [19]. However, promising insights that can be derived from Twitter data are often limited by what is missing, specifically the basic sociodemographic information of each Twitter user. The demographic attributes of users are often required in health research for subpopulation analyses, to explore differences, and to identify inequity. Without evidence of the distal and proximal factors that lead to racial and ethnic health disparities, it is impossible to address and correct these drivers. Insights from social media data can be used to inform service provision as well as to develop targeted health messaging by understanding public perspectives from diverse populations.

Extracting Demographics From Twitter
However, to use social media and digital health research to address disparities, we need to know not only what is said on Twitter but also who is saying what [20]. Although others have discussed extracting or estimating features, such as location, age, gender, language, occupation, and class, no comprehensive review of the methods used to extract race or ethnicity has been conducted [20]. Extracting the race and ethnicity of Twitter users is particularly important for identifying trends, experiences, and attitudes of racially and ethnically diverse populations [21]. As race is a social construction and not a genetic categorization [22,23], the practice of defining race and ethnicity in health research has been an ongoing, evolving challenge. Traditional research has the advantage of identifying the person in the study and allowing them to systematically identify their racial and ethnic identities. In digital health research [22,23], determining a user's race or ethnicity by extracting data from a user's Twitter profile, metadata, or tweets is a process that is inevitably challenging, complex, and not without ethical questions.
Furthermore, although Twitter is used for international research, an international comparative study of methods to determine race or ethnicity is difficult, practically impossible, given that societies use different standardized categories that describe their own populations [24]. A common approach in the United States is based on the US Census Bureau practice to allow participants to identify with as many as 5-6 large racial groupings (Black, White, Asian, Pacific Islander, Native, and other), while separately choosing one ethnicity (Hispanic) [25]. However, race and ethnicity variables continue to be misused in the study design or when drawing conclusions. For example, race or ethnicity is often incorrectly treated as a predictor of poor health rather than as a proxy for the impact of being a particular race or ethnicity has on that person's experience with the health system [26]. Simply put, health disparities are driven by racism, not race [27][28][29]. Although race or ethnicity affiliation is an important factor in understanding diverse populations, digital research must tread lightly and thoughtfully both the collection and assignment of race or ethnicity.

Objectives
The lack of basic sociodemographic data on Twitter users has led researchers to apply a variety of approaches to better understand the characteristics of the people behind each tweet. The breadth of the landscape of approaches to extracting race or ethnicity is currently unknown. Our overall aim was to summarize and assess the range of computational and manual methods used in research based on Twitter data to determine the race or ethnicity of Twitter users.

Overview
We conducted a comprehensive scoping review of extraction methods and offered recommendations and cautions related to these approaches [30]. We selected Twitter, as it is currently the most commonly used social media platform in health care research, and it has some unique intrinsic characteristics that drive the methods used for mining it. Thus, we felt that the methods, type of data, and social media platforms used are related in such a way that comparing methods for different social media would add too many variables and would not be truly comparing like with like. A detailed protocol was designed for the methods to be used in our scoping review, but we were unable to register scoping reviews on PROSPERO. We report our methods according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) scoping review statement [30]. studies that extracted race or ethnicity from social media platforms other than Twitter, from unspecified social media platforms, or those that used multiple social media platforms that included Twitter, but the data relating to Twitter were not presented separately.

Intervention
Studies were included where the methods to extract or infer the race or ethnicity data of Twitter users were stated. Articles that used machine learning (ML), natural language processing (NLP), human-in-the-loop, or other computationally assisted methods to predict race or ethnicity of users were included, as were manual or noncomputational methods, including photo recognition or linking to census data. We excluded studies for which we were unable to determine the methods used or for which we extracted data solely on other demographic characteristics, such as age, gender, or geographic location.

Comparator
The use of a comparison of the methods used was not required. A method could be compared with another (such as a gold standard), or no comparison could be undertaken.

Outcome
The extraction or inference of the race or ethnicity of Twitter users was the primary or secondary outcome of the study. As this was a scoping review in which we aimed to demonstrate the full landscape of the literature, no particular measurement of the performance of the method used was required in our included studies.

Study Design
Any type of research study design was considered relevant. Discussion papers, commentaries, and letters were excluded.

Limits
No restrictions on date, language, or publication type were applied to the inclusion criteria. However, no potentially relevant studies were identified in any non-English language, and the period by default was since 2006, the year of the inception of Twitter.

Search Strategy
A database search strategy was derived by combining three facets: facet 1 consisted of free-text terms related to Twitter (Twitter OR Tweet* OR Tweeting OR Retweet* OR Tweep*); facet 2 consisted of terms for race or ethnicity; and facet 3 consisted of terms for methods of prediction, such as ML, NLP, and artificial intelligence-related terms (Table S1 in Multimedia Appendix 1 [3,10,12,18,20,21,). All ethnology-related subject terms were adapted for different database taxonomies and syntax, with standard methods for predicting subject terms in MEDLINE and other database indexing. The methods of predicting term facets were expanded using a comprehensive list of specific text analysis tools and software names extracted from the study by Hinds and Joinson [97], which included a comprehensive list of automated ML processes used in predicting demographic markers in social media. Additional terms have been added from a related study [98].

Sources Searched
A wide range of bibliographic and gray literature databases were selected to search for topics on computer science, health, and social sciences. The databases ( Reference checking of all included studies and any related systematic reviews identified by the searches were conducted. We browsed the Journal of Medical Internet Research, as this is a key journal in this field, and hand searched 2 relevant conferences, the International Conference on Weblogs and Social Media and Association for Computational Linguistics proceedings. Citations were exported to a shared Endnote library, and duplicates were removed. The deduplicated records were then imported into Rayyan to facilitate independent blinded screening by the authors. Using the inclusion criteria, at least two screeners (SG, RS, KO, or RJ) from the research team independently screened each record, with disputes on inclusion discussed and a consensus decision reached.
Only the first 50 records from ACL and the first 100 records from a Google Scholar search were screened during two searches (March 11, 2020, and May 24, 2021) as these records are displayed in order of relevance, and it was felt that after this number no relevant studies were being identified [12,21,99].

Data Extraction
For each included study, we extracted the following data on an excel spreadsheet: year of publication, study country and language, race or ethnicity categories extracted (such as for race-Black, White, or Asian or for ethnicity-Hispanic or European), and paper type (journal, conference, or thesis). We also extracted details on extraction methods (such as classification models or software used), features and predictors used in extraction (tweets, profiles, and pictures), number of Twitter users, number of tweets or images used, performance measures to evaluate methods used (validation), and results of any evaluation (such as accuracy). All performance measure metrics were reported as stated in the included studies. All the extracted data were checked by 2 reviewers.

Quality Assessment
There was no formally approved quality assessment tool for this type of study. As this was a scoping review, we did not carry out any formal assessment. However, we assessed any validation performed and whether the methods were reproducible.

Data Analysis
We have summarized the stated performance of the papers that included validation. However, we could not compare approaches using the stated performance, as the performance measures and validation approaches varied considerably. In addition, there is no recognized gold standard data set for comparison.

Overview
A total of 1735 records were entered into an Endnote library (Clarivate), and duplicates were removed, leaving 1249 (72%) records for sifting ( Figure 1). A total of 1080 records were excluded based on the title and abstract screening alone. A total of 169 references were deemed potentially relevant by one of the independent sifters (RS, GG, RJ, SG, and KO). The full text of these articles was screened independently, and 67 studies [12,21,99] met our inclusion criteria and 102 references were excluded [77,97,. The main reason for exclusion was that although the abstract indicated that demographic data were collected, it did not include race or ethnicity (most commonly, other demographic attributes such as gender, age, or location were collected). Other reasons for exclusion were that the researchers collected demographic data through surveys or questionnaires administered via Twitter (but not from data posted on Twitter) or that the researchers used a social media platform other than Twitter.
Some studies (12/67, 18%) treated race as a binary classification, such as African American or not or African American or White, whereas others created a multiclass classifier of 3 (15/67, 22%) or 4 classes (33/67, 49%) or a combination of classes. A total of 6 studies identified >4 classes; however, these often included ethnicity or nationality classifiers as well as race [38,48,54,66,83,95]. Wang and Chi [77] was a conference paper which did not report the race types extracted.
The data objects from Twitter used to extract race or ethnicity varied, with the use of profile pictures or Twitter users' names being the most common. Others have also used tweets in the users' timeline, information from Twitter bios, or Twitter users' locations. Most studies (39/67, 58%) used more than one data object from Twitter data. In addition, the data sets within the studies varied in size between 392 and 168,000,000, with those using manual methods having smaller data sets ranging from just 392 [50] to 4900 [65].
Unfortunately, although performance has been measured in 67% (45/67) of studies (this was inconsistently measured Table 2). The metrics used to report results were particularly varied for studies using ML or NLP and included the F 1 score (which combines precision and recall), accuracy, area under the curve, or mean average precision. Table 2 lists the methods, features, and reported performance of the top model from each study.

Manual Screening
A total of 12 studies used manual techniques to classify Twitter users into race or ethnicity categories [21,36,40,[49][50][51]57,65,[87][88][89][90]. These studies generally combined qualitative interpretations of recent tweets, information in user bios making an affirmation of racial or ethnic identity, or photographs or images in the user timeline or profile.
In most cases, tweets were first identified by text matching based on terms of interest in the research topic, such as having a baby with a birth defect [50], commenting on a controversial topic [57,89], or using potentially gang-or drug-related language [40]. Researchers then identified the tweet authors and, in most cases, assigned race or ethnicity through hand coding based on profile and timeline content. Some studies coded primarily based on self-identifying statements of race used in a tweet or in users' bios, such as people stating that they are a Black American [49,50,88,90] or hashtags [36] (such as #BlackScientist). Others coded exclusively based on the research team's attribution of racial identity through the examination of profile photographs [21,57] or avatar [87]. Some authors coded primarily with self-declarations, with secondary indicators, such as profile pictures, language, usernames, or other content [40,51,65,88,89]. In most cases, it appears reasonable to infer that coding was performed by the study authors or members of their research teams, with the exception of those using the crowdsourcing marketplace, Amazon Mechanical Turk [21,90].
The agreement among coders was sometimes measured, but validity and accuracy measurements were not generally included. A study [65], however, documented 78% reliability for coding race compared with census demographics, with Black and White users being coded accurately 90% of the time and Hispanic or Asian users being accurately coded between 45% and 60% of the time. The high accuracy of Black users was based on the higher likelihood of Black users to self-identify.

Census-Driven Prediction
Another approach to predict race or ethnicity is to use demographic information from the national census and census-like data and transfer it to the social media cohort. The US-based studies largely used census-based race and ethnicity categories: Asian and Pacific Islander, Black or African American, Latino or Hispanic, Native American, and White. A UK-based study included the categories British and Irish, West European, East European, Greek or Turkish, Southeast Asian, other Asian, African and Caribbean, Jewish, Chinese, and other minorities [83].
We identified 14 studies [39,48,52,54,60,63,70,71,74,77,[83][84][85]95] that used census geographic data, census surname classification, or a combination of both. A total of 6 studies incorporated geographic census data [39,52,63,74,83,84]. For example, Blodgett et al [39] created a simple probabilistic model to infer a user's ethnicity by matching geotagged tweets with census block information. They averaged the demographic values of all tweets by the user and assumed this to be a rough proxy for the user's demographics. Stewart [74] collected tweets tagged with geolocation information (longitude and latitude). The ZIP code of the user was derived from this geolocation information and matched with the demographic information found in the ZIP Code Tabulation Area defined by the Census Bureau. This information was used to find a correlation between ethnicity and African American vernacular English syntax [74].
Other studies have used the census-derived name classification system to determine race or ethnicity based on user names. We identified 12 studies that predicted user race or ethnicity using surnames [48,54,60,63,70,71,77,[83][84][85]95,189]. Surnames were used to assign race or ethnicity using either a US census-based name classification system or, less commonly, an author in-house generated classification system. Of these 12 studies, 7 (58%) relied solely on the user's last names [48,54,60,63,70,71,85]. Of those that reported validating the system, validation methods of this name-based system alone were not reported, but 4 (33%) of the 12 studies reported an accuracy between 71.8% and 81.25% [63,70,71,83]. Of note, a study reported vastly different accuracies in predicting whiteness versus blackness (94% predicting White users vs 33% predicting African American or Black users) [83]. The remaining 2 studies augmented name-based predictions with aggregate demographic data from the American Community Survey or equivalent surveys. For example, statistical and text mining methods have been used to extract surnames from Twitter profiles, combining this information with census block information based on geolocated tweets to assess the probability of the user's race or ethnicity [60]. However, these studies did not report validation or accuracy.
Labeled data sets are used to train and test supervised and semisupervised ML models and to validate the output of unsupervised learning methods. Some of the studies used previously created data sets that contained demographic information, such as the MORPH longitudinal face database of images [189], a database of mugshots [38], or manually annotated data from previous studies [79,81]. Others created ground truth data sets from surveys [96] or by semiautomatic means, such as matching Twitter users to voter registrations [37], using extracted self-identification from user profiles or tweets [67,68,81], or using celebrities with known ethnicities [66]. Manual annotation of Twitter users was also used based on profile metadata [34,35,46,76], self-declarations in the timeline [61,82], or user images [35,94]. Table 2 summarizes the best performing ML approach, features used, and the reported results for each study that used automatic classification methods. In the table, the classifier is the number of race or ethnicity classification groups, ML model is the top performing algorithm reported, and features are the variables used in the predictions.
Data from Twitter are inherently imbalanced in terms of race and ethnicity. In ML, it is important to attempt to mitigate the effects of the imbalance, as the models have difficulty learning from a few examples and will tend to classify to the majority class and ignore the minority class. Few studies (12/67, 18%) have directly addressed this imbalance. Some opted to make the task binary, focusing only on their group of interest versus all others [67,68,94] or only on the majority classes [38,76]. Others choose modified performance metrics that account for imbalance when reporting their results [33,61,82]. A group, which was classified based on images, supplemented their training set from an additional data source for the minority classes [33,35]. Only 2 studies have experimented with comparator models trained on balanced data sets. In a study by Wood-Doughty et al [81], the majority class was undersampled in their training sets and [96] the minority classes were oversampled. In both cases, the overall performance of the models decreased in accuracy from 0.83 to 0.41 (on their best performing unbalanced model) and 0.84 to 0.68. [96], as the performance boost from the models, the superior performance on the majority class was eradicated.
In total, 2 studies that used Face++ [32,58] did not measure its performance. Another study [44] stated that Face++ could identify race with 99% confidence or higher for 9% of total users. In addition, 2 studies [53,55] used Face++ along with other methods. One of these studies used Face++ in conjunction with demographics, using a given name or full name from a database that contains US census data for demographics. This study simply measured the percentage of Twitter users for which race data could be extracted (46% college students and 92% role models) but did not measure the performance of Face++ [53]. Another study [55] built a classifier model on top of using Face++ and recorded an accuracy of 83.8% when compared with users who stated their nationality.
A total of 4 studies [45,62,69,75] (with the same data set in full or in part) used the average confidence level reported by Face++ for race which was 85.97 (SD 0.024%), 85.99 (SD 0.03%), 86.12 (SD 0.032%), respectively, with a CI of 95%. When one of these studies [45] carried out its own accuracy assessment, they found an accuracy score of 79% for race when compared with 100 manually annotated pictures. Huang et al [56] also carried out an accuracy assessment and found that Face++ achieved an averaged accuracy score of 88.4% for race when compared with 250 manually annotated pictures.
A total of 5 studies [12,[41][42][43]73] used Demographics Pro, and although they reported on Demographics Pro success in general, they did not directly report any metrics of its success. The 2 studies using Onomap provided no validation of the software [59,86].
In light of our results, we have compiled our recommendations for best practice, which are summarized in Figure 2 and further examined in the Discussion section.

Principal Findings
As there are no currently published guidelines or even best practice guidance, it is no surprise that researchers have used a variety of methods for estimating the race or ethnicity of Twitter users. We identified four categories for the methods used: manual screening, census-based prediction, ad hoc ML or NLP, and off-the-shelf software. All these methods exhibit particular strengths, as well as inherent biases and limitations.
Comparing the validity of methods for the purpose of deriving race or ethnicity is difficult as classification models differ not only in approach but also in the definition of the classification of race or ethnicity itself [112,202,203]. There is also a distinct lack of evaluation or validation of the methods used. Those that measured the performance of the methods used found accuracy to vary from 45% to 93%, with significantly lower accuracy in identifying categories of people of color.
This review sheds little light on the performance of commercial software packages. Previous empirical comparisons of facial recognition application programming interfaces have found that Face++ achieves 93% accuracy [204] and works comparatively better for men with lighter skins [205]. The studies included in our review suggested a lower accuracy. However, data on accuracy were not forthcoming in any of the included studies using Demographics Pro [200]. Even when performance is assessed, the methodology used may be biased if there are issues with the gold standard used to train the model.
In addition to the 4 overarching methods used, the studies varied in terms of the features used to determine or define race or ethnicity. Furthermore, the reliability of the features used to determine or define race or ethnicity for this purpose is questionable. Specifically, the use of Twitter users' profile pictures, names, and locations, the use of unvalidated linguistic features attributed to racial groups (such as slang words, African American vernacular English, Spanglish, or Multicultural London English), and the use of training data that are prone to perpetuate biases (eg, police booking photos or mug shots) were all of particular concern.

Issues Related to the Methods Used
Approaches that include or rely solely on profile pictures to determine race or ethnicity can introduce bias. First, not all users have a photograph as their profile picture, nor is it easy to determine whether the picture used is that of the user. A study on the feasibility of using Face++ found that only 30.8% of Twitter users had a detectable single face in their profile. A manual review of automatically detected faces determined that 80% could potentially be of the user (ie, not a celebrity) [206]. Human annotation may introduce additional bias, and studies have found systematic biases in the classification of people into racial or ethnic groups based on photographs [207,208]. Furthermore, humans tend to perceive their own race more readily than others [209,210]. Thus, race or ethnicity in the annotation team has an impact on the accuracy of their race or ethnicity labels, potentially skewing the sample labels toward the race or ethnicity of the annotators [211,212]. Given ML and NLP methods are trained on these data sets, the human biases transfer to automated methods, leading to poorly supervised ML and training, which has been shown to result in discrimination by the algorithm [213][214][215]. These concerns did not appear to be interrogated by the study designers. Without exception, they present categorization of persons into race or ethnicity, assuming that a subjective reading of facial features or idiomatic speech is the gold standard both for coding of race or ethnicity and for training and evaluation of automated methods.
Other methods, such as using geography or names as indicators of race, may also be unreliable. One could argue that the demographic profile for a geographic region is a better representation of race or ethnicity in the demographic environment than an individual's race or ethnicity. Problems in using postcodes or locations to decipher individual social determinants are well documented [216]. The use of census data from an area that is too large may skew the results. Among the studies reviewed, some used census block data, which are granular, whereas others extrapolated from larger areas, such as city-or county-level data. For example, Saravanan [72] inferred the demographics of users in a city as a certain ethnic group based on a city with a large population of that group; however, no fine-grained analysis was performed either for the city chosen or for geolocation of the Twitter user. Thus, the validity of their assumption that a user in Los Angeles County is of Mexican descent [72] is questionable. As these data were then used to create a race or ethnicity dictionary of terms used by that group to train their model, the questionable assumption further taints downstream applications and results. The models also do not consider the differences between the demographics of Twitter users and the general demographics of the population.
In addition, census demographic data that uses names are also questionable because of name-taking in marriage and indiscernible names.
The practice of using a Twitter user's self-reported race or ethnicity would provide a label with high confidence but restrict the amount of usable data and introduce a margin of error depending on the method used to extract such self-reports. For example, in a sample of 14 million users, >0.1% matched precise regular expressions created to detect self-reported race or ethnic identity [128]. Another study used mentions of keywords related to race or ethnicity in a user's bio; however, limited validation was conducted to ensure that the mention was actually related to the user's race or ethnicity [67,68]. This lack of information gathered from the profile information leads to sampling bias in the training of the models [152].
Some models trained on manually annotated data did not have high interannotator agreement; for example, Chen et al [46] crowdsourced annotation agreement measured at 0.45. This can be interpreted as weak agreement, with the percentage of reliable data being 15% to 35% [217]. Training a model on such weakly labeled data produces uncertain results.
It is not possible to assume the accuracy of black box proprietary tools and algorithms. The only race or ethnicity measure that seems empirically reliable is self-report, but this has considerable limitations. Thus, faulty methods continue to underpin digital health research, and researchers are likely to become increasingly dependent on them. The gold standard data required to know the demographic characteristics of the Twitter user is difficult to ascertain.
The methods that we highlight as best practices include directly asking the Twitter users. This can be achieved, for example, by asking respondents of a traditional survey for both their demographic data and their Twitter handles so that the data can be linked [96]. This was undertaken in the NatCen Social Research British Social Attitudes Survey 2015, which has the added benefit of allowing the study of the accuracy of further methods for deriving demographic data [20]. Contacting Twitter users may also provide a gold standard but is impractical, given the current terms of use of Twitter that might consider such contact a form of spamming [72,204,205,216]. A limitation of extracting race or ethnicity from social media is the necessity to oversimplify the complexity of racial identity. The categories were often limited to Black, White, Hispanic, or Asian. Note that Hispanic is considered ethnicity by the US census, but most studies in ML used it as a race category, more so than Asian (because of low numbers in this category). Multiple racial identities exist, particularly from an international perspective, which overlooks multiracial or primary and secondary identities. In addition, inferred identities may differ from self-identity, raising further issues.
Given the sensitive nature of the data, it is important as a best practice for the results of studies that derive race or ethnicity from Twitter data to be reproducible for validation and future use. The reproducibility of most of the studies in this review would be difficult or impossible, as only 5 studies were linked to available code or data [38,47,79,81,108]. Furthermore, there is limited information regarding the coding of the training data. None of the studies detailed their annotation schemas or made available annotation guidelines. Detailed guidelines as a best practice may allow recreation or extension of data sets in situations where the original data may not be shared or where there is data loss over time. This is particularly true of data collected from Twitter, where the terms of use require that shared data sets consist of only tweet IDs, not tweets, and that best efforts to delete IDs from the data set if the original tweet is removed or made private by the user be in place. Additional restrictions are placed on special use cases for sensitive information, prohibiting the storage of such sensitive information if detected or inferred from the user. Twitter explicitly states that information on racial or ethnic origin cannot be derived or inferred for an individual Twitter user and allows academic research studies to use only aggregate-level data for analysis [218]. It may be argued that this policy is more likely to be targeted at commercial activities.

Strengths and Limitations
We did not limit our database searches and other methods by study design; however, we were unable to identify any previous reviews on the subject. To the best of our knowledge, this is the first review of methods used to extract race or ethnicity from social media. We identified studies from a range of disciplines and sources and categorized and summarized the methods used. However, we were unable to obtain information on the methodologies used by private-sector companies that created software for this purpose. Marketing and targeted advertising are common on social media and are likely to use race as a part of their algorithms to derive target users.
We did not limit our included papers to those in which the extraction of race or ethnicity was the primary focus. Although this can be conceived as a strength, it also meant that reporting of the methods used was often poor. The accurate recreation of the data lost was hampered by not knowing how decisions were made in the original studies, including what demographic definitions of race or ethnicity were used, or how accuracy was determined. This limited the assessment of the included studies. Few studies have validated the methods or conducted an error analysis to assess how often race is misapplied and those that did, rarely used the most appropriate gold standard. This makes it difficult to directly compare the results of the different approaches.

Future Directions
Future studies should investigate their methodological approaches to estimate race or ethnicity, offering careful interpretations that acknowledge the significant limits of these approaches and their impact on the interpretation of the results. This may include reporting the results as a range that communicates the inherent uncertainty of the classification model. Social media data may best be used in combination with other information. In addition, we must always be mindful that race is a proxy measure for the much larger impact of being a particular race or ethnicity in a society. As a result, the variability associated with race and ethnicity might reveal more about the effects of racism and social stratification than about individual user attributes. To conduct this study ethically and rigorously, we recommend several practices that can help reduce bias and increase reproducibility.
We recommend acknowledging the researchers' bias that can influence the conceptualization of the implementation of the study. Incorporating this reflexivity, as is common in qualitative research, allows for the identification of potential blind spots that weaken the research. One way to address homogenous research teams is through the inclusion of experts in race or ethnicity or in those communities being examined. These biases can also be reduced by including members of the study population in the research process as experts and advisers [219]. Although big data from social media can be collected without ever connecting with the people who contributed the data, it does not eliminate the ethical need for researchers to include representative perspectives in research processes. Examples of patient-engaged research and patient-centered outcomes research, community-based participatory research, and citizen science (public participation in scientific research) within the health and social sciences amply demonstrate the instrumental value and ethical obligation of intentional efforts to involve nonscientist partners in cocreation of research [219]. The quality of data science can be improved by seriously heeding the imperative, Nothing about us without us [219]. Documenting and establishing the diverse competence attributes of a research team should become a standard. Emphasizing the importance of diverse teams within the research process will contribute to social and racial justice in ways other than improving the reliability of research.
In terms of the retrieved data, the most reliable (though imperfect) method for ascertaining race was when users self-identified their racial affiliation. Further research on overcoming the limitations of availability and sample size may be warranted. Indeed, a hybrid model with automated methods and manual extraction may be preferred. For example, automation methods could be developed to identify potential self-declarations in a user profile or timeline, which can then be manually interpreted.
Finally, we call for greater reporting of the validation by our colleagues. Without error analysis, computational techniques would not be able to detect bias. Further research is needed to establish whether any bias is systematic or random, that is, whether inaccuracies favor one direction or another.

Conclusions
We identified major concerns that affect the reliability of the methods and bias the results. There are also ethical concerns throughout the process, particularly regarding the inference of race or ethnicity, as opposed to the extraction of self-identity. However, the potential usefulness of social media research requires thoughtful consideration of the best ways to estimate demographic characteristics such as race and ethnicity [112]. This is particularly important, given the increased access to Twitter data [202,203]. Therefore, we propose several approaches to improve the extraction of race or ethnicity from social media, including representative research teams and a mixture of manual and computational methods, as well as future research on methods to reduce bias.

Acknowledgments
This work was supported by the National Institutes of Health (NIH) National Library of Medicine under grant NIH-NLM 1R01 (principal investigator: GG, with coapplicants KO and SG) and NIH National Institute of Drug Abuse grant R21 DA049572-02 to RS. NIH National Library of Medicine funded this research but was not involved in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Data Availability
The included studies are available on the web, and the extracted data are presented in Table S2 in Multimedia Appendix 1. A preprint of this paper is also available: Golder S, Stevens R, O'Connor K, James R, Gonzalez-Hernandez G. 2021. Who Is Tweeting? A Scoping Review of Methods to Establish Race and Ethnicity from Twitter Datasets. SocArXiv. February 14. doi:10.31235/osf.io/wru5q.

Authors' Contributions
SG, RS, KO, RJ, and GG contributed equally to the study. RS and GG proposed the topic and the main idea. SG and RJ were responsible for literature search. SG, RS, KO, RJ, and GG were responsible for study selection and data extraction. SG drafted the manuscript. SG, RS, KO, RJ, and GG commented on and revised the manuscript. SG provided the final version of this manuscript. All authors contributed to the final draft of the manuscript.