Enabling naturalistic neuroscience through behavior mining: Analysis of long-term human brain and video recordings

Much of our understanding in human neuroscience has been informed by data collected in pre-designed and wellcontrolled experimental tasks, where timings of cues, stimuli, and behavioral responses are known precisely. Recent technological advances have enabled us to study longer and increasingly naturalistic brain recordings, giving rise to a new paradigm named “naturalistic neuroscience” where neural computations associated with spontaneous behaviors are studied. Analyzing such unstructured, long-term, and multi-modal data with no a priori experimental design remains very challenging. Here we present an automated approach for analysing naturalistic datasets using behavior mining. Our analysis pipeline robustly uncovers and annotates instances of human upper-limb movements in long-term naturalistic behavior data (≈18 million video frames per patient) using algorithms from computer-vision, time-series segmentation, and string pattern-matching. We analyze simultaneously recorded human electrocorticography (ECoG) brain recordings to uncover neural correlates associated with these naturalistic events, and show that they corroborate prior results from traditional controlled experiments. We also demonstrate the efficacy of our approach as a source of training data for brain-computer interface decoders.


Introduction
Human neuroscience experiments are traditionally performed in controlled laboratory settings using carefully designed experiments to isolate the effects of a treatment or perturbation from a baseline control condition. Typically, the timing of cues or stimuli presented to the human subject, as well as the timing and measurements of their behavioral responses, are known precisely. Enabled by advances in brain recording technology, "naturalistic neuroscience" (Huk, Bonnen, & He, 2018) is an emerging paradigm to study neural computa-tions associated with spontaneous behaviors in freely behaving subjects.
Accompanying this trend, a rich set of computational tools are being developed for automated analysis of animal behavior, a field also known as "computational ethology" (Anderson & Perona, 2014;Todd, Kain, & de Bivort, 2017;Berman, 2018;Pereira et al., 2019;Markowitz et al., 2018;Wiltschko et al., 2015). These methods process the video recordings of one or more animals through an extensive analysis pipeline, often including steps such as: segmenting the subject(s) from the background, centering and rotating the subject to a standard pose, imputing missing frames, tracking specific joints across frames, and classifying actions.
There have been some recent attempts at applying such automated analysis in human neuroscience (N. X. R. Wang, Olson, Ojemann, Rao, & Brunton, 2016;N. X. Wang, Farhadi, Rao, & Brunton, 2018;Alasfour et al., 2019;Chen et al., 2018). Though there exists a large body of computervision research concerned with human action recognition (Ramasamy Ramamurthy & Roy, 2018), most methods are concerned with distinguishing coarse-grained activities such as sitting vs. walking, and consist of techniques requiring a huge amount of labeled training data. To answer the kinds of questions that neuroscientists and neuroengineers are focused on, we instead need to be able to produce a finegrained segmentation with sub-second temporal resolution, ideally using none or few behavior labels ahead of time.
Here we take a behavior-mining approach that robustly uncovers instances of human upper-limb movements in longterm, naturalistic behavior data and connects them to simultaneously recorded brain recordings. In particular, we use pose estimation algorithms to extract human arm pose trajectories and then segment these trajectories in time using unsupervised latent-variable models. Interestingly, having a discrete sequential representation of pose allows us to simplify the problem of detecting behavioral events to that of pattern matching on strings. Our approach also extracts additional behavioral metadata associated with the events, such as movement angle, magnitude, and duration. Lastly, we uncover neural correlates associated with the events by analyz-

928
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 ing the simultaneously recorded human intracranial electrocorticography (ECoG) brain recordings; these naturalistic neural correlates further corroborate results from traditional, controlled experiments. Our preliminary investigations suggest that these naturalistic events may be used as training data for brain-computer interface (BCI) decoders.

Dataset
We use simultaneous brain and behavior recordings from consenting epilepsy patients undergoing long-term monitoring (≈ 7 − 10 days) at the University of Washington (UW) Harborview Medical Center in Seattle, WA, after obtaining approval from the UW Institutional Review Board's human subject division.
Intracranial recordings are acquired at a sampling rate of 1000 Hz and are obtained through ≈ 80 electrodes on the cortical surface. Our data collection is opportunistic, with electrode locations and duration of recording being determined by clinical needs. Audio and video recording are obtained using a wall-mounted RGB/Infrared camera recording at ≈ 30 frames per second (≈18 million video frames/patient over 10 days).
Together, these amount to ≈ 250 GB of data per patient.

Data Analysis Pipeline
Video recordings are processed through a pipeline of poseestimation, time-series segmentation, event detection, and metadata extraction.

Human Pose Estimation
We train a markerless pose-estimation tool (Mathis et al., 2018) on a small random sample (≈ 1000 frames) of the video recordings using manual annotations for multiple keypoints on the patient's body, namely, the nose, and both wrists, elbows, shoulders, and ears ( Figure 1). The trained pose estimator is scaled using cloud computing to extract pose trajectories for the entire duration of the video recording.

Pose Time-series Segmentation
We seek a segmentation of the pose time series that is easily interpretable and temporally precise. With this in mind, we apply a first-order autoregressive hidden Markov model (Wiltschko et al., 2015) with two latent states to the pose time series of a single keypoint at a time (e.g. left wrist, see Figure 2). Segmentation results in a representation of the pose dynamics into discrete rest and move states with high (perframe) temporal precision. It is also robust to variation in lighting, camera angle, and level of activity in the video.

Event Mining
Discretizing the continuous pose time-series reduces the problem of finding behavioral events to a string pattern matching problem, for which there exist several algorithmic tools. We use regular expressions to extract events from this discrete representation. For example, movement-initiation events are found by searching for a pattern of 15 consecutive rest states followed by 15 consecutive move states. Similarly, nomovement events are found by looking for a state sequence  of prolonged rest states across both wrists and the nose.

Event Metadata Extraction
For each detected event, we extract several metadata (Figure 4) that could be used for analytics ( Figure 5) or for filtering events before further analysis. Metadata extracted include (x, y) pixel coordinates of a keypoint at the start and end of an event, duration of an event, rest duration before and after a movement event, and average pose-estimation confidence (for filtering out poorly tracked events). When processing wrist keypoints, we define a reach to be the wrist's maximum-displacement during the event compared to its location at the start of the event. In addition, we extract magnitude, angle, and duration of the reach associated with each movement event.

Towards Naturalistic Neuroscience
We evaluate the use of these naturalistic wrist movementinitiation events in two ways. First, we examine their neural correlates and, second, we use them to train neural decoders to detect future events.

Neural Correlates
We compute baseline-subtracted spectrograms for individual wrist movement-initiation events and average them to obtain event-averaged spectrograms ( Figure 6). The observed activation pattern shows movement associated increases in a high-frequency band (HFB, 76 − 100 Hz) and decreases in a low-frequency band (LFB, 8 − 32 Hz) in motor-cortex electrodes, corroborating previously reported findings (Miller et al., 2007).

Neural Decoding
We evaluate the use of our pipeline as a source of training data for a neural decoder classifier that detects wrist movement-initiation. In other words, it discriminates movement-initiation from no-movement events. We train a Random Forest classifier using spectral power features (HFB and LFB) extracted from the ECoG signal around event times Figure 6: Spectrogram shows movement-associated activation reduction at low-frequencies and activation increase at high-frequencies, corroborating previously reported findings (Miller et al., 2007). [Inset] Electrode grid position (light blue) shown on left side view of patient's brain. The red electrode's spectrogram is shown here.
(400 ms before to 600 ms after each event). The classifier is trained on on events from days 3 & 4 of patient stay, and tested on events from day 5. Table 1 summarizes a preliminary assessment of performance on this task. The classifier performs well above chance, and its performance is comparable to previously reported decoders (N. X. Wang et al., 2018).

Discussion
In summary, we describe a workflow for analyzing simultaneously recorded long-term naturalistic human brain and behavioral video data that does not involve any a priori experimental design. The events and their metadata derived using this pipeline are amenable to a variety of scientific analyses; here we have examined their neural correlates, and have used them to train neural decoders.
There are several extensions of our work that we are currently exploring. We can investigate multi-keypoint interactions at a fine-grained temporal resolution by combining single-keypoint state sequences. In particular, we are interested in the various possible temporal sequence combinations of wrist movement. (e.g. raising the left and right wrists simultaneously vs. one after the other.) As a natural extension to predicting movement initiation, we plan to predict entire movement trajectories. Additionally, our pipeline has the capability of generating an orderof-magnitude more events compared to the traditional controlled laboratory experiment paradigm (Schalk & Leuthardt, 2011). This substantial expansion in training data could improve the accuracy of brain-computer interface decoders and make them more robust to variability when deployed in realworld scenarios.