Crowd annotations can approximate clinical autism impressions from short home videos with privacy protections

Artificial Intelligence (A.I.) solutions are increasingly considered for telemedicine. For these methods to serve children and their families in home settings, it is crucial to ensure the privacy of the child and parent or caregiver. To address this challenge, we explore the potential for global image transformations to provide privacy while preserving the quality of behavioral annotations. Crowd workers have previously been shown to reliably annotate behavioral features in unstructured home videos, allowing machine learning classifiers to detect autism using the annotations as input. We evaluate this method with videos altered via pixelation, dense optical flow, and Gaussian blurring. On a balanced test set of 30 videos of children with autism and 30 neurotypical controls, we find that the visual privacy alterations do not drastically alter any individual behavioral annotation at the item level. The AUROC on the evaluation set was 90.0% ±7.5% for unaltered videos, 85.0% ±9.0% for pixelation, 85.0% ±9.0% for optical flow, and 83.3% ±9.3% for blurring, demonstrating that an aggregation of small changes across behavioral questions can collectively result in increased misdiagnosis rates. We also compare crowd answers against clinicians who provided the same annotations for the same videos as crowd workers, and we find that clinicians have higher sensitivity in their recognition of autism-related symptoms. We also find that there is a linear correlation (r = 0.75, p < 0.0001) between the mean Clinical Global Impression (CGI) score provided by professional clinicians and the corresponding score emitted by a previously validated autism classifier with crowd inputs, indicating that the classifier's output probability is a reliable estimate of the clinical impression of autism. A significant correlation is maintained with privacy alterations, indicating that crowd annotations can approximate clinician-provided autism impression from home videos in a privacy-preserved manner.


Introduction
While artificial intelligence (A.I.) approaches are needed for healthcare to achieve scale and consistency, current A.I.-powered solutions to diagnostics in psychiatry and the behavioral sciences are underperformant due to the complexity of the underling behaviors. Until A.I. has advanced to a point where it can seamlessly pass the Turing Test [40] and understand human social behavior, with all its subtleties and nuances, to a degree that surpasses an average human's social acuity (i. e., to the level of a licensed clinician), it is unlikely that A.I. will replace any behavioral healthcare jobs. In the meantime, A.I. solutions which can augment the capabilities of non-expert humans may allow for scalable and accessible remote diagnostics while easing the burden of professional clinicians and the healthcare system to provide initial consultations in person.
A.I. is increasingly being considered to help healthcare solutions scale. For such solutions to be deployed in clinical settings, the system must garner maximal trust from all stakeholders. Privacy rises to the forefront of patient concern when humans are incorporated into the diagnostic pipeline [57]. Many parents are uncomfortable with sharing raw videos of their children with strangers online even if the purpose of sharing is to help the parents receive affordable and accessible diagnostic services [53]. To help ameliorate this uneasiness, privacy concerns must be directly addressed in any scalable solution where the humans in the loop are strangers (i.e., crowdsourced workers).
Here, we address these issues by studying an A.I.-augmented pipeline for detecting Autism Spectrum Disorder, or autism, from unstructured home videos. Autism is a developmental delay which is currently estimated to affect 1 in 40 children [28]. While access to care requires a formal diagnosis, access to diagnostic services is severely limited for many families, thus limiting potential care. Some evidence suggests that as much as 80% of families in the United States lack access to care [35], and underserved populations are disproportionately affected [17]. A.I. powered telemedical solutions therefore have the possibility to help these families, many of whom may otherwise lack access to traditional health services.
In order for such solutions to truly scale for annotation by a large crowd workforce, the privacy of the patients must be preserved. This is important in general for healthcare applications, but it is especially critical when the patients in question are young children with a developmental delay and observed by a stranger in their home. Our prior work has shown that applying targeted privacy-preserving modifications to videos, such as pitch shifting and covering the child's face with a virtual box, results in minimal degradation of the quality of crowdsourced annotations used for remote detection of autism-related behaviors [53]. However, such lightweight privacy protections may be insufficient to some patients, such as those who do not want the interior of their home exposed to strangers on the Internet.
Here, we explore the effect of standard visual privacy-preserving mechanisms on the annotation quality of videos of children with autism and matched neurotypical controls. We apply pixelation, dense optical flow, and Gaussian blurring, which are either standard methods for protecting the privacy of human subjects in image and video datasets [36] or for representing visual features for activity recognition algorithms [26]. We study the effect of these video transformations on annotation quality. On a balanced test set of 60 videos of children with autism and matched neurotypical controls, we find that no visual privacy condition deviates from the unaltered condition by more than half of a categorical ordinal severity point out of 4 questions corresponding to behavior severity. We compare crowd responses to professional clinicians and find that the probability score emitted by the classifier is consistent with the clinicians' global impression from watching the same video. We also find that, as expected, clinicians are more adept at identifying autism-related symptoms than crowd workers.

Balanced video dataset
We leveraged a balanced video dataset of 30 children with autism and 30 controls without autism. Both groups were gender and aged matched: we posted 30 videos of male children (13 with autism and 17 neurotypical) and 30 videos of female children (17 with autism and 13 neurotypical). The mean age in the videos of children with autism was 3.49 years old (SD = 1.58 years old), and the mean age in the videos of children without autism was 3.41 years old (SD = 1.39 years old).

Logistic regression classifier for predicting autism
We utilized a previously validated [38,39,53] logistic regression classifier for predicting autism vs. not autism from the answers to the multiple-choice questions we asked crowd workers. The classifier [29,31] was derived from the Autism Diagnostic Observation Schedule (ADOS) module 2 [32] and was trained on clinician filled scoresheets from the Boston Autism Consortium (AC), the Simons Simplex Collection v14 (SSC) [10], Autism Genetic Resource Exchange (AGRE) [11], National Database of Autism Research (NDAR) [16] and the Simons Variation in Individuals Project (SVIP) [37].
To generate confidence intervals, we performed 10,000 iterations of a bootstrapping procedure for each video. In each iteration, we sampled with replacement from the 60 videos used for evaluation and computed the accuracy, precision, recall/sensitivity, specificity, Area Under the Receiver Operating Characteristic (AUROC), and Area Under the Precision-Recall Curve (AUPRC) on the resulting video set.

Privacy conditions
We evaluated three privacy modifications applied to each video: pixelation ( Fig. 1), dense optical flow (Fig. 2), and Gaussian blurring (Fig. 3). Pixelation and Gaussian blurring are common methods for protecting the privacy of human subjects in image and video datasets [36]. Dense optical flow is a standard method for representing visual features for activity recognition [26] but also obfuscates much of the visual features of a frame, resulting in privacy preservation. To apply pixelation, we first resized the input frames down to 32 × 32 pixels. We then calculated the final pixelated frame by resizing the smaller frame back to the original frame size using bilinear interpolation. To calculate dense optical flow, we applied Farnebäck's algorithm for two-frame motion estimation based on polynomial expansion [9]. The image was colored through obtaining a 2-channel array with optical flow vectors, where the direction of the vectors corresponds to the hue of the image and the magnitude of the vectors corresponds to the value (lightness) of the image. To apply Gaussian blurring, we used a blurring kernel with 1/4th the width and height of the input image.
We note that pixelation ( Fig. 1) makes the background setting and human more discernible (less private) than Gaussian blurring (Fig. 3), and Gaussian blurring is more discernible (less private) dense optical flow (Fig. 2).

Privacy condition analysis
To understand the effects of the strength of the privacy condition on worker rating ability, we explored the effects of altering both blurring and pixelation intensity on crowd ratings. We did not conduct this analysis for optical flow since there is no clear parameterization of intensity level for this method. We used a separate set of 28 videos balanced by diagnosis, child age, and child gender. Crowd workers were randomly assigned groups. As with the main study, no crowd worker was assigned to two separate conditions (alteration, intensity level) of a video.

Crowd annotation
To recruit crowd workers, we posted a video rating task on the crowdsourcing website Microworkers.com [18]. In this video rating task, crowd workers were asked to answer a series of 13 multiple choice questions, with 4 answer choices each, about the child's behavior exhibited in each of 8 video videos (4 featuring a child with autism and 4 featuring a child without autism). We used the categorical ordinal variables corresponding to each worker's answers as the inputs to a pretrained binary logistic regression classifier for autism (see subsection Logistic regression classifier for predicting autism below for details about classifier training). 1,000 crowd workers completed the task and passed basic quality control checks for answer acceptance. Quality control measures included time spent on the annotation task and deviations in answers between videos [53].
To identify workers who were invited to participate in the final study, we measured the classifier's prediction on all 8 videos for all workers. We then measured the mean probability of the correct class (PCC) for each worker across all 8 videos, where the PCC is the classifier's output probability p when the true class of a video is autism and 1-p otherwise. Out of the 1,000 workers evaluated, exactly 40 workers had a mean PCC at or above 80%. We recruited these 40 crowd workers to rate the primary video set of 60 videos used in this study.
Each of the 40 crowd workers were tasked with rating all 60 videos described in the subsection Balanced video dataset. We randomly split the 40 workers into 4 groups, and each group was assigned to one privacy condition per video. Therefore, no worker saw more than one version of each video. All workers saw exactly 15 unaltered videos, 15 videos with pixelation, 15 videos with dense optical flow, and 15 videos with Gaussian blurring. This ensured that no privacy condition was affected

Clinician annotation
We recruited 19 clinicians to rate the same balanced video dataset and provided the same categorical ordinal video-wide annotations as the crowd workers. All clinicians were licensed professionals who provide diagnoses of autism as part of their job duties. We asked all clinicians an additional question which crowd workers were not asked: "Do you think the child has autism?" The answer choices were: • No, I am confident the child does not have autism (0) • No, but I am unsure (1) • Yes, but I am unsure (2) • Yes, I am confident the child has autism (3) All 60 videos received at least 1 rating by a clinician. Some videos were rated by more than 1 clinician, in which case we recorded the mean of the clinician answers for that video. For the "Do you think the child has autism?" question, we coded the responses from 0 to 3 as shown above.
Clinicians were also asked to provide a Clinical Global Impression (CGI) [12] rating for the children in the videos. The CGI scale measures the "severity of illness" between 1 ("normal, not at all ill") to 7 ("among the most extremely ill patients") and is designed to allow clinicians to provide a global impression without providing a formal diagnosis.

Item-level analysis
For each question that we asked crowd workers, we measured the mean absolute deviation of the mean answer for each privacy condition from the mean answer for the baseline condition. This difference provides a measure of the privacy condition's effect on annotation quality. We hypothesized that some questions would be more susceptible to alteration with certain privacy conditions applied.

Results
All procedures performed in studies involving human participants were approved by the Stanford University Institutional Review Board and are in accordance with the 1964 Helsinki declaration and its later amendments.

Performance of clinicians and crowd workers
Out of the 30 videos of children with autism, clinicians rated 8 videos confidently (with a mean autism rating above 2.5 out of 3.0). By contrast, clinicians rated 18 of the videos of neurotypical children confidently (with a mean autism rating below 0.5 out of 3.0). All 8 confidently rated videos were of children with autism, while 16 of the 18 neurotypical children were correctly identified (only 2 were actually diagnosed with autism). This suggests that clinicians observing remote videos of children are cautious about calling an autism diagnosis, but when they do guess a diagnosis, the child is very likely to actually have autism.
The clinician's classifier correctly identified 28 of the 30 autism cases while only correctly identifying 14 of the 30 neurotypical cases. By contrast, the crowd's classifier correctly identified 25 of the 30 autism cases and 27 of the 30 neurotypical cases. This suggests that clinicians are more sensitive to autism-related symptoms than crowd workers, thus resulting in a higher frequency of autism diagnoses by the binary classifiers.
There is a clear linear correlation (r = 0.75, p < 0.001) between the mean Clinical Global Impression (CGI) score provided by professional clinicians for each video and the corresponding classifier score emitted by the logistic regression classifier with crowd inputs (Fig. 4). This suggests that crowd responses in conjunction with machine learning algorithms can approximate clinician intuition, and the classifier's output can be interpreted as a reliable approximation of clinical global impressions of autism (see Fig. 5).

Effect of privacy conditions
The linear correlation between the mean Clinical Global Impression (CGI) score provided by professional clinicians for each video and the corresponding classifier score emitted by the logistic regression classifier with crowd inputs is maintained with privacy-preserving video modifications. The correlation is weaker for Gaussian blurring (r = 0.64, p = 0.001 for Gaussian blurring) than for dense optical flow and pixelation (r = 0.71, p = 0.0002 for pixelation for both).

Effect of blurring and pixelation intensity
Interestingly, we did not observe a dramatic difference between the privacy alteration intensities depicted in Figs. 1 and 3. Table 1 shows the mean performance metrics on a separate testing set with blurring intensities created using blurring kernel sizes of 1/6th, 1/5th, 1/4th, 1/ 3rd, and ½ the size of the original image as well as a blurring kernel the entire size of the image. Table 2 shows the performance on a third disjoint testing with pixelation intensities created using an intermediate frame size of 96 × 96, 64 × 64, 48 × 48, 32 × 32, 16 × 16, and 8 × 8. These results indicate that the most dramatic privacy-preserving alternations can be applied with minimal to no degradation of performance across the behavioral questions we asked crowd workers. Table 3 displays the mean absolute deviation of the mean answer for each privacy condition from the mean answer for the baseline condition. This difference provides a measure of the privacy condition's effect on annotation.

Item-level analysis
We found that pixelation resulted in smaller deviations from the unmodified video condition compared to dense optical flow and Gaussian blurring. In all but one behavioral annotation (sharing excitement), the mean deviation for pixelation was less than the other two privacy conditions. Dense optical flow, which provides maximal privacy, did not have a discernible difference from Gaussian blurring (private but less so), providing support for the use of dense optical flow in translational settings.
The annotation with the lowest deviation across all conditions was for displaying aggressive behavior. This result matches intuition, as no aggressive behavior was displayed in any of the videos we presented.
None of the 9 behaviors used for the classifiers contained a mean deviation above 0.5; the mean deviation is less than one half of the distance between one categorical ordinal variable representing symptom severity and the variable indicating one severity level higher (all questions contained 4 multiple choice options). We note that while these deviations are consistently small, the aggregation of these deviations results in higher rates of misclassification (see Results: Effect of privacy conditions).

Discussion and conclusion
We explored the potential for global image transformations to provide privacy for video subjects while preserving behavioral annotation quality. While no individual question was drastically degraded when privacy alterations were applied, some behavioral annotations were degraded more than others. Pixelation consistently resulted in less drastic degradations than blurring and optical flow. We also found that the classifier's predictions from the crowd's annotation of the unaltered videos were strongly correlated with clinician global impressions. A slightly weaker correlation persisted even after all privacy modifications we tested, providing evidence that the classifier's output can be considered as an estimation of clinical global autism impression scores even when annotations are provided for a privacy-preserved video.
There are several limitations to the present study. While the configuration of questions we asked workers and clinicians resulted in worse performance by the classifier using the clinicians' annotations, this could have been due to over-sensitivity of the classifier rather than anything the clinicians did incorrectly. We therefore do not make any claims about the performance of clinicians as compared to crowd workers. An interesting limitation is that the definition of autism tends to shift over time with evolving DSM criteria and clinical practices [1,33]. Because clinicians were providing annotations several years after the videos of children were recorded, it is possible that the children who did not qualify for a diagnosis at the time the videos were recorded would qualify for a diagnosis by the time the clinicians reviewed the Fig. 5. The linear correlation between the mean Clinical Global Impression (CGI) score provided by professional clinicians for each video and the corresponding classifier score emitted by the logistic regression classifier with crowd inputs is maintained with privacy-preserving video modifications. The correlation is weaker for Gaussian blurring (r = 0.64, p = 0.001 for Gaussian blurring) than for dense optical flow and pixelation (r = 0.71, p = 0.0002 for both).

Table 1
The effect of increasing levels of blurring intensity on mean performance of a separate testing set from the primary evaluation. videos.
There are several interesting avenues of future work. The annotations provided by crowd workers can potentially be used to train computer vision classifiers detecting behaviors relevant to autism detection such as emotion evocation [15,19,45,50], hand or head stimming [47], and abnormal eye contact. While humans are worse at detecting certain behavioral patterns from videos when privacy mechanisms are applied, it is possible that convolutional neural networks can more easily detect these features by learning subtle and nonlinear feature maps beyond human comprehension. Alternative privacy-preserving video alteration methods using deep learning could also be explored, such as generative adversarial networks (GANs). GANs can create privacy-preserved versions of the input video. Similarly, variational autoencoders can learn privacy-based features in the latent space. As machine learning algorithms continue to improve and relevant databases become more plentiful, the possibility of removing humans from the remote detection pipeline seems increasingly possible.
Future work should ensure that all methods work for all stakeholders. Such methods should therefore be evaluated across races, ethnicities, and other sensitive attributes to ensure fair and unbiased A.I. While some machine learning methods may help account for biased datasets, no technique matches the benefit of using balanced data. Fair and balanced healthcare A.I. initiatives must explicitly recruit participants in equal numbers across all demographics served to enable equitable services.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: DPW is the founder of Cognoa.com. This company is developing digital health solutions for pediatric care. All other authors declare no competing interests.

Table 2
The effect of increasing levels of pixelation intensity on mean performance of a separate testing set from the primary evaluation.  Table 3 The mean absolute deviation for each privacy condition from the baseline condition answers for the behaviors used as inputs to the autism classifier. This difference provides a measure of the privacy condition's effect on annotation quality.