A survey of skin tone assessment in prospective research

Increasing evidence supports reduced accuracy of noninvasive assessment tools, such as pulse oximetry, temperature probes, and AI skin diagnosis benchmarks, in patients with darker skin tones. The FDA is exploring potential strategies for device regulation to improve performance across diverse skin tones by including skin tone criteria. However, there is no consensus about how prospective studies should perform skin tone assessment in order to take this bias into account. There are several tools available to conduct skin tone assessments including administered visual scales (e.g., Fitzpatrick Skin Type, Pantone, Monk Skin Tone) and color measurement tools (e.g., reflectance colorimeters, reflectance spectrophotometers, cameras), although none are consistently used or validated across multiple medical domains. Accurate and consistent skin tone measurement depends on many factors including standardized environments, lighting, body parts assessed, patient conditions, and choice of skin tone assessment tool(s). As race and ethnicity are inadequate proxies for skin tone, these considerations can be helpful in standardizing the effect of skin tone on studies such as AI dermatology diagnoses, pulse oximetry, and temporal thermometers. Skin tone bias in medical devices is likely due to systemic factors that lead to inadequate validation across diverse skin tones. There is an opportunity for researchers to use skin tone assessment methods with standardized considerations in prospective studies of noninvasive tools that may be affected by skin tone. We propose considerations that researchers must take in order to improve device robustness to skin tone bias.

underreported-only 10% (7/70) of deep learning algorithms include information about skin tone 16 and few report performance by skin tone categories 17 .Further, there is no gold standard for skin tone labeling, and commonly used practices like estimated Fitzpatrick Skin Type are limited by uncertainty 18 and lack of inclusiveness.(ref.11; ref. 12; ref. 10; ref. 19)  To improve generalizability of assessments and algorithm fairness, it is critical that patient skin tone variation be taken into account in validation studies of noninvasive technologies.There are initiatives to modify device monitoring regulation criteria, such as those released by the FDA in November 2023 20 .In this review, we will (1) briefly present a review of background and methods for skin tone measurement within health care then (2) provide detailed study considerations for measuring skin tone in prospective trials.

Results
Part I. Review of skin tone assessment Defining skin tone.Color is the perception of light based characterizations, such as hue, lightness, and saturation.Physiologically, the inherent color of the skin, defined as "skin tone", is the result of light absorbing compounds called chromophores.The most abundant chromophores in humans are melanin (pheomelanin and eumelanin), carotene, oxygenated hemoglobin, and reduced hemoglobin 21 .In general, the two major contributors to skin tone are melanin, which produces a brown tint, and hemoglobin, which creates red and purple-blue hues 22 .Frequent methods for discriminating skin tone for the purpose of validating noninvasive assessment tools are administered visual scales and color measurement tools 23 (Table 1).Skin tone can also be extracted from camera images through a variety of techniques 24 .For the purposes of this paper, automated skin color extraction, modeling, and labeling will not be discussed 23,25,26 .
Administered visual scales.Administered visual scales, such as Von Luschan, Monk, and Pantone, utilize numbered colored tiles that are matched to a person's skin tone (Table 1).Fitzpatrick Skin Type (FST) was originally developed to assess tanning and burning propensity, however, many use FST as a proxy for skin tone 27 despite evidence showing that FST is poorly correlated with objective measurements of skin color evaluation 27,28 .Although widely available and inexpensive to administer, visual scales can be limited by subjectivity.Furthermore, visual scales can be affected by complex human perception of color which is influenced by light, anatomic site, the context of the object, or a person's unique experiences with similar objects [29][30][31][32] .
Color measurement tools.Color measurements are objective measurements achieved through reflectance spectrophotometry (Konica Minolta CM700D, Variable Spectro) and colorimetry (Delfin SkinCo-lorCatch) (Table 1).In current works, color measurement tools are being utilized as a by-product of the limitations of visual scales by to providing objective measurements to increase precision in color quantification [33][34][35] .Color measurements offer greater color precision, but the tools are expensive and devices are sensitive to environmental influences 25,26 .
Cameras and color spaces.Several differing color models provide a framework to systematically or mathematically describe color output.One of the most common, the RGB model (red, green, blue), was developed to mimic the primary colors perceived by the eye 32 .It encodes color in an additive fashion where a combination of all three colors results in white.Other color models include HSL (hue, saturation, lightness), CIELAB (lightness, green-red gradient, blue-yellow gradient), and CIECAM02 (brightness, lightness, chroma, saturation, hue, and "colorfulness") 32,36 .Most reproduction of color on printed work is exported in a CMYK (cyan, magenta, yellow, black) color space.Color spaces are a specific organization of colors that are mapped to values in color models in a standardized way.The standardized RGB color space (sRGB) is the most commonly used space for representing digital images on displays 26,37 .
Part II.Considerations for study design Body part assessed.A summary of recommendations for skin tone measurement can be found in Table 2. Unaltered skin tone represents a combination of genetic factors and environmental influence based on constitutive (baseline skin color) and facultative (skin color altered by sun exposure) grouping.Constitutive skin color is best characterized in sunprotected areas more likely to represent unaltered baseline pigmentation 38 .However, one's perceived skin tone may also be influenced by exogenous factors including artificial tanner, makeup, or tattoo pigmentation.Depending on the technology being validated, the inclusion of at least one constitutive skin site may be important given its decreased variability across seasons and increased correlation with skin phototype 39 .The upper volar arm has been proposed as a reliable measurement of constitutive skin given its low seasonal variability and ease in access 40 .Otherwise, body part selection may be predetermined based on the application (e.g., using the finger/earlobe in pulse oximetry).
Underlying conditions that can affect skin tone.Several conditions can affect the relative concentration and distribution of chromophores and alter skin tone assessment.Therefore, study designs incorporating skin tone measurement should consider pigmentary disorders (e.g., vitiligo or melasma) and medical conditions (e.g., anemia and jaundice) that influence skin tone.Perfusion-related changes in skin (e.g., flushing, blanching) can also affect skin tone assessment 34 .To minimize these effects, it is recommended that skin measurements occur in a pressurefree state and at rest, and to collect as much information about factors that influence skin tone at the time of measurement as feasible.
Ambient lighting.The impact of lighting on the perception of color is critical and commonly overlooked in study design.Ambient lighting can come in various forms, such as brightness and temperature.Ambient lighting can influence color perception based on time of day and location and may skew skin tone perception, making it appear lighter or darker 41 .Ambient lighting should be both sufficient and standardized to increase the accuracy and precision of skin tone assessment.To prevent variability in daylight conditions, artificial lighting with similar temperature to natural light (5000-6500 K) could be helpful 42 .A controlled illumination source, combined with an ambient light-blocking feature, can significantly enhance light isolation and improve the signal-to-noise ratio 43 .
Location.Considerations for skin tone assessment depend in part on the location of the patient population under study.
For example, an outpatient clinic may be a single location where ambient lighting and temperature may be more easily standardized.Patients are often mobile, making it more feasible to incorporate skin tone measurements on less accessible sun-protected body parts (e.g., lower back).These may be difficult when collecting remote photos from a patient's home where lighting may not be standardized and number of body parts for measurement may be limited.Additionally, longitudinal study design may need to account for patients' variable sun exposure.Previous studies have included non-sun exposed body parts, advised participants to avoid sun exposure and/or require the use of sunscreen on a daily basis to attempt to address this 44 .
In contrast, an inpatient population presents complications when attempting to achieve a more fixed environment for data collection and synchronization of measurements.A more fixed environment for data collection can potentially be improved by understanding the unique workflow of standard care and adjusting each patient's room to mimic a standardized environment.Although dependent on study design, study procedures may need to take into consideration patient health status, iatrogenic complications, patient mobility, and other external factors that could potentially hinder temporal aspects that are essential for the completion of measurements.For instance, in the context of pulse oximetry, short timepoints for data collection may be needed to minimize potential discrepancies between arterial blood gas (ABG)-pulse oximetry measurements and skin tone readings 45 .Dataset balance-skin tone and race.Clinical research of all kinds must incorporate racial and ethnic diversity to ensure results are generalizable, especially when evaluating devices for clinical use 46,47 .Underrepresentation of minority groups may lead to a higher risk of adverse reactions or reduced efficacy.In fact, the NIH has an issued policy and guideline requiring all phase III clinical trials to ensure analysis by sex/ gender, race, and/or ethnicity 48 .However, multiple studies have demonstrated large variations in skin tone within racial and ethnic subgroups 49 , and skin tone may directly influence bias of technologies beyond race.This highlights the potential need to balance datasets specifically by skin tone in addition to race.Ensuring dataset balance by skin tone may pose several challenges.Since skin tone varies within a person and across time, one may consider balancing only by constitutive body site or averaging skin tone from multiple body sites.Both skin tone and racial/ethnic dataset variation will enhance the generalizability of research results.However, investigators may consider balancing by either skin tone or race with a minimum threshold of other parameters based on their research question.Initial explorations by the Food and Drug Administration are soliciting input on methods to integrate skin tone measures into clinical device studies, but a standard has not been established 50 .
Considerations for administered visual scales.Visual scales are lowcost tools that can distinguish skin tones with relatively high reliability, require minimal training, are widely available, and can be utilized in various forms of analyses (retrospective, prospective, and post-hoc).
Limitations of these scales are that they are influenced by user perception (e.g., color blindness) and environmental conditions (e.g., ambient lighting).There are also several visual scales available which can make comparison of skin tone data across studies difficult.Few studies have assessed the relative utility of visual scales.FST was designed as a questionnaire to determine tanning and burning propensity, but when used as a proxy for skin tone is only weakly associated with a visual color scale (p < 0.0001) 51 .There have been attempts to create RGBdefined FST visual scales, although these have not been widely adopted 52 .The Von Luschan scale has been shown to be highly correlated with narrow band spectrophotometry with one study showing correlation of VLS and Melanin + erythema index to be 0.90 (p < 0.001) 53 .The scales with more levels (Pantone, Taylor Hyperpigmentation) offer greater granularity and shade range for skin tone assessment, but may be more challenging to reliably administer.For the Pantone scale, one study investigating vascularized allotransplantation matching found inter-rater skin tone assessment to be fair (k = 0.454) and intra-rater reliability to be substantial (k = 0.725) 54 .A newer, 10-point Monk scale shows high reliability for crowdsourced annotators (ICC 0.86-0.94),but has not been tested in a medical context 55 .Price may also be a consideration when choosing a scale.While the Monk scale is freely available and free to use and the FST questionnaire is easily accessible online, the Pantone scale is only available for purchase.
Considerations for color measurements tools.Colorimetric and spectrophotometric devices are used in a wide range of study designs to assess skin tone by quantifying melanin, erythema, and overall skin pigmentation.Common reflectance colorimetric and spectrophotometric devices are composed of an illuminator, standard observer, and a tristimulus measurement system.The illuminator of the instrument applies a fixed light source to a desired surface and specific wavelengths are then isolated to obtain color details without influence of outside lighting conditions when pressed gently to the skin.Colorimetric and spectrophotometric devices are easily-operated, non-invasive methods of measurement to achieve objective skin tone measurement.
In varying clinical settings, handheld colorimetric and spectrophotometric devices (e.g., Delfin SkinColorCatch, Konica Minolta, and Variable Spectro) may be easier to use to assess body parts while also maintaining patient comfort.Although the devices demonstrate moderate to high interobserver reliability, devices are potentially high-cost, and most devices have not had large-scale published validation.A few studies have attempted to compare the utility of objective color measurement tools, but were limited in scope 33,56,57 .Consequently, a particular type of colorimetric/spectrophotometric device has not been proven to be superior 33 Considerations for cameras.Cameras are widely used, and available in the pocket of almost everyone a patient interacts with across the medical practice.There are also large datasets of images that have been acquired to train AI algorithms and other applications 11,58 .However, it can be challenging to extract skin tone information from a photograph alone.Camera type, export compression level, and lighting will be critical to consider 59 .One of the most important modifiable factors is the white balance, which affects the relationship between red, green, and blue pixel values 60 .The use of cross polarizing filters can help reduce specular reflection 61 and improve skin tone evaluation, especially in darker toned individuals 43 .Reference color charts or color calibration cards (eg.X-Rite, Douglas, DSC Labs, QPcard, Macbeth) can be used within the frame of the image or before/after in identical conditions to improve reproducibility across devices, but will not be available for retrospectively captured images 59,60 .After acquisition, proper image processing is necessary to maintain color accuracy across mediums.Although popular image storage mediums like JPEG increase computational efficiency, image compression can lead to artifact and lost image parameter information.Therefore, RAW image format may be helpful for maintaining color consistency, although large file size may limit its utility, and many photography devices (eg.phones) cannot acquire in RAW format 32 .
Scale/device reliability.The process of evaluating device bias against skin tone measurement is nascent.The portability and low cost benefits of visual scales will need to be balanced against the potential increased accuracy of color measurement technologies that also include continuous measurement compared to categorical bins.Having at least two raters use visual scales and conducting triplicate readings for color measurement tools will increase color precision.When possible, comparison between multiple color measurement instruments will be valuable to the specific study and field.

Discussion
Consideration of skin tone in device validation studies across medicine is important to reduce bias against patients with darker skin tones that exists in pulse oximetry, AI diagnosis, and many other areas of medicine.These biases will worsen existing healthcare disparities unless they are addressed and measured directly.In many cases, race and ethnicity play specific roles in equity focused-medicine and biased outcomes arise due to these sociodemographic factors 62 .A considerable amount of research is race and ethnicity focused, but for technologies that rely on light for measurement, their bias may be specifically related to skin tone.This highlights the need for increased awareness of limitations of current medical devices associated with systematic error and pronounced inaccuracies among patients with a darker skin tone.
In this review, we highlight common tools used for skin tone measurement and discuss pertinent study design considerations for accurate skin tone assessment.There is no current gold standard tool as each possesses relative pros and cons, and validation is largely absent.Visual scales are more readily available for prospective or post-hoc analysis, but may be influenced by user perception, while color measurement tools offer objective, sensitive measurements but can be expensive with variable reliability.In addition to tool choice, investigators should consider how patient level factors may affect skin tone validity, including selection of a body site, consideration for medical conditions affecting skin tone, and minimization of perfusionrelated color changes.Furthermore, creation of a standardized environment with consistent lighting and camera settings will promote improved color consistency.This paper is a narrative review and therefore results are limited by the non-systematic approach.Further, the study is not powered to directly compare the utility of skin tone assessment modalities or quantify the potential effect of study design parameters on skin tone accuracy.
When prospectively evaluating devices that may be influenced by skin tone, incorporation of skin tone measurement will play an important role in considering these potential biases.The current review offers researchers a tool to aid in development of skin tone assessment protocols.We encourage researchers to continue to focus on validating devices against a diverse and representative dataset, and when possible, to make public the skin tone measurements for future use and calibration.
In conclusion, increasing evidence shows bias and increased error in noninvasive tools across medicine in patients with darker skin tones.We provide guidance and consideration when conducting skin tone assessments using administered scales (eg.Fitzpatrick, Pantone, Monk) and color measurement tools (colorimeters, spectrophotometers), encouraging device validation to include at least one color measurement tool.As our awareness as investigators consider skin tone as a variable in future work, we will be able to reduce skin tone biases in medical devices.
receives research funding from Lutris Pharma, and in kind research support from Kaggle and AWS through the Open Data Program.J.W.G. is a 2022 Robert Wood Johnson Foundation Harold Amos Medical Faculty Development Program.All other authors have no financial or nonfinancial competing interests to disclose.

Table 1 |
Descriptions of common skin tone measurement tools.This is a nonexclusive list npj Digital Medicine | (2024) 7:191

Table 1 (
continued) | Descriptions of common skin tone measurement tools.This is a nonexclusive list npj Digital Medicine | (2024) 7:191

Table 2 |
Considerations for skin tone measurement in prospective research studies Race should not be used as a proxy for skin tone -Datasets should be balanced by skin tone and race to account for disparities from either factor Considerations for Measurement Tools Choice of Administered Scale -Administered scales are low cost, widely available, are useful in prospective and post-hoc analysis, but can be influenced by user perception -Choice of administered scale will depend on price, size, range, and tool consistency -Larger scales may have more granularity but decreased inter-rater reliability Choice of Color Measurements -Colorimetric and spectrophotometric devices are easily-operated and non-invasive -Color space measurements can be interconverted allowing for data sharing -Several colorimeters and spectrophotometers exist and choice may depend on device size, cost, and clinical setting -Able to quantify erythema, melanin, pigmentation, and skin color Choice of Camera -Camera choice of commercial (iPhone, DSLR) vs specialized (dermatoscope, VECTRA, VISTA) may depend on price, access, and setting -Camera settings, including aperture, shutter speed, lighting, and image format should be standardized -White balance is a critical modifiable factor, thus one may consider using color calibration cards in the image frame