Scratch-AID: A Deep-learning Based System for Automatic Detection of Mouse Scratching Behavior with High Accuracy

Mice are the most commonly used model animals for itch research and for development of antiitch drugs. Most labs manually quantify mouse scratching behavior to assess itch intensity. This process is labor-intensive and limits large-scale genetic or drug screenings. In this study, we developed a new system, Scratch-AID (Automatic Itch Detection), which could automatically identify and quantify mouse scratching behavior with high accuracy. Our system included a custom-designed videotaping box to ensure high-quality and replicable mouse behavior recording and a convolutional recurrent neural network (CRNN) trained with frame-labeled mouse scratching behavior videos, induced by nape injection of chloroquine (CQ). The best trained network achieved 97.6% recall and 96.9% precision on previously unseen test videos. Remarkably, Scratch-AID could reliably identify scratching behavior in other major mouse itch models, including the acute cheek model, the histaminergic model, and a chronic itch model. Moreover, our system detected significant differences in scratching behavior between control and mice treated with an anti-itch drug. Taken together, we have established a novel deep learning-based system that is ready to replace manual quantification for mouse scratching behavior in different itch models and for drug screening.

Mice are the most widely used model animals for studying itch mechanisms and for developing new pre-clinical anti-itch drugs (Han et al., 2013;Liu et al., 2009;Solinski et al., 2019;Sun and Chen, 2007).Since itch is an unpleasant sensation that provokes the desire to scratch (Ikoma et al., 2006), scratching behavior has been assessed as a proxy for itch intensity in mice (Liu et al., 2009;Morita et al., 2015).Till now, this quantification process has been mainly conducted by watching videos and manually counting scratching bouts or the total scratching time, which is tedious and time consuming, unavoidably introduces human errors and bias, and limits the large-scale genetic or drug screenings.
Given the biological importance and the obvious need, several research groups have tried different strategies to automate this process, including an acoustic recording method (Elliott et al., 2017), a method using magnetic field and metal ring to detect paw movement (Mu et al., 2017), and several video analysis-based approaches (Bohnslav et al., 2021;Kobayashi et al., 2021;Park et al., 2019;Sakamoto et al., 2022).Nevertheless, these methods have not been widely adopted by other research labs, due to the uncertain performance of the trained models in different lab environments, the requirement of specialized equipment, and/or inadequate evaluation of these methods in different mouse itch models.
In recent years, with the rapid progress in the field of artificial intelligence, deep learning has been applied in various scientific research areas.For example, convolutional neural networks (CNN) are widely used in computer visual recognition tasks (Gu et al., 2018), whereas recurrent neural networks (RNN) (Graves, 2013) are developed for analyzing temporal dynamic features.Moreover, rapid improvement of computing power, especially in the graphics processing unit (GPU) capacity, together with new open-source deep learning libraries, such as PyTorch (Paszke et al., 2019), Keras (Gulli and Pal, 2017) and Tensorflow (Abadi et al., 2016), have greatly accelerated broad applications of deep learning.
Animal behavior analysis is one of the research areas benefiting from the applications of deep learning.For example, DeepLabCut can track different body parts in freely moving animals for behavior analysis (Mathis et al., 2018).DeepEthogram recognizes and annotates different behaviors of mice and flies (Bohnslav et al., 2021).These examples support the proof-of-principle that deep-learning is a powerful avenue for automating animal behavior analysis.Nevertheless, for a given animal behavior, like mouse scratching, a designated method, which achieves high sensitivity, specificity, and generalization to replace human observers, still needs to be established.
To meet this challenge, we developed a new deep learning-based system, Scratch-AID (Automatic Itch Detection), which achieved automatic quantification of mouse scratching behavior with high accuracy.We first designed a videotaping box to acquire high-quality recording of mouse behavior in a reproducible environment from the bottom.We recorded 40 videos of 10 adult wildtype mice (5 males and 5 females) after nape injection of a non-histamine pruritogen, chloroquine (CQ), and manually labeled all video frames as the reference.We then designed a convolutional recurrent neural network (CRNN) by combining CNN and RNN and trained it with 32 scratching videos from 8 randomly picked mice.We obtained a series of prediction models using different training parameters and evaluated these models with test videos (8 unseen test videos from the 2 remaining mice).The best trained model achieved 97.6% recall and 96.9% precision on test videos, similar to the manual quantification results.Impressively, Scratch-AID could also quantify scratching behavior from other major acute and chronic itch models with high accuracy.Lastly, we applied Scratch-AID in an anti-itch drug screening paradigm and found that it reliably detected the drug effect.In summary, we have established a new system for accurate automatic quantification of mouse scratching behavior.Based on the performances, Scratch-AID is ready to replace manual quantification for various mouse itch models and for drug or genetic screenings.

Results:
The overall workflow Our workflow to develop a new system for detection and quantification of mouse scratching behavior consists of four major steps (Figure 1A): 1) Videotape mouse scratching behavior induced by an acute nape itch model; 2) Manually annotate scratching frames in all videos for training and test datasets; 3) Design a deep learning neural network; train this network with randomly selected training videos and adjust different training parameters; and evaluate the performance of the trained neural networks on test videos; 4) Evaluate the generalization of the trained neural network in additional itch models and a drug screening paradigm.

Design a videotaping box for high-quality and reproducible recording of mouse scratching behavior
High-quality videos recorded from a reproducible environment are critical for stable performance of trained prediction models and for adoption by other research labs.Thus, we designed a mouse videotaping box for such a purpose.It consisted of two boxes with white acrylic walls joined by a transparent acrylic floor (Figure 1B).The top ("mouse") box (Length x Width x Height = 14.68 x 14.68 x 5 cm) had a lid for mouse access.The bottom ("camera") box (Length x Width x Height = 14.68 x 14.68 x 23.6 cm) had a door for access to the camera (Logitech C920e Business Webcam).The walls and the lid of the box were non-transparent to minimize interference from outside visual stimuli.Ambient light penetrated the walls and provided sufficient illumination for behavior recording.A mouse could freely move inside the top box, and a camera recorded mouse behavior from the bottom (30 frames/second).Compared to the top or side views, the bottom view can clearly capture the key body parts involved in scratching behavior, such as the scratching hind paw, mouth, etc., and their movements in great details (Video S1).The magnification, resolution, and brightness of the video can be adjusted by the camera recording software (Logitech C920e Business Webcam driver and software) to achieve consistent video recording.In short, this customized videotaping box allows high-quality video recording of mouse scratching and other behaviors in a stable and reproducible environment.
Spontaneous scratching is a rare event under normal conditions in mice.Itch sensation and scratching are usually induced by different itch models for research.Common mouse itch models are classified as cheek or nape, histaminergic or non-histaminergic, and acute or chronic, based on the body location where itch sensation is evoked, the kind of pruritogens, and the duration of itch sensation (Ikoma et al., 2006;Liu and Dong, 2015;Shimada and LaMotte, 2008;Thurmond et al., 2008).We first used an acute nape itch model induced by a non-histaminergic pruritogen, chloroquine (CQ), because it triggered immediate and robust scratching behavior in mice (Liu et al., 2009).After intradermal injection of CQ (200 μg in 15 μl saline) into left or right nape of the mice, a 20-min video was recorded using the customized videotaping box (Figure 1C).
In response to CQ injection, mice scratched the affected skin region using their ipsilateral hind paw.Evoked mice displayed multiple episodes of scratching, which were separated by nonscratching phases.Each episode of scratching, here defined as a scratching train, usually contained four phases: start (lifting the scratching hind paw toward the affected skin), scratching bout (rhythmic movement of hind paw against the affected skin), paw licking (putting the scratching hind paw into the mouth and licking), and end (putting down the scratching hind paw back to the floor) (Figure 1D; Video S1).The cycle of scratching bout and paw licking might occur once or repeat multiple times, depending on the itch intensity and the internal state of the mouse.The time from the start to the end of a given scratching train is defined as the duration.The total scratching time is the sum of durations of all scratching trains, which is an effective parameter to quantify the scratching behavior and assess itch intensity.

Video annotation
40 scratching videos from 10 adult wildtype C57 mice (5 males and 5 females, 2-3-monthold) were recorded (Table S1).For neural network training and testing and for comparing performances between trained neural networks and manual quantification, two methods were used to annotate mouse scratching behavior in these videos.The first method (manual annotation) was to watch these videos at normal (1X) speed (30 frames/second) and label the start and end time points (converted into frame numbers for subsequent analysis) of each scratching train.This is consistent with the field common practice for manually quantifying mouse scratching behavior.Our manual annotation results were produced by 10 human observers, thus reflecting an averaged precision of the manual quantification process.The second method (reference annotation, or ground-truth annotation) was to accurately determine the start and end of each scratching train by analyzing each video frame-by-frame.The reference annotation of the 40 videos were used as the training and test datasets.The total number of scratching train in each video and the distribution of scratching train durations were quantified (Figures 1E and 1F; Table S2).

Deep learning neural network design and model training
Mouse scratching behavior displayed unique dynamic (temporal) and static (spatial) features, which were highlighted by tracking the key body parts using DeepLabCut (Mathis et al., 2018) (Figure S1; Video S2).One of the most obvious dynamic features was the rhythmic movement of the scratching hind paw (Figures S1A and S1B).Some unique static features included the relative positional relationships between the scratching hind paw and other body parts (Figures S1C-S1F).To fully capture these dynamic and static features, we designed a convolutional recurrent neural network (CRNN) to take advantage of the different strengths of CNN and RNN (Figures 2A and S2A).The CRNN contained a CNN (ResNet-18 (He et al., 2016)) (Figure S2B) that extracts static features, such as the relative position of different body parts, a RNN (2-layer bidirectional gated recurrent unit (GRU) (Dey and Salem, 2017)) (Figure S2C) that extracts dynamic features, such as the rhythmic movement of the scratching hind paw in consecutive frames, and a fully connected layer (the classifier) to combine the features extracted by both CNN and RNN and generate the prediction output.
The 40 videos were randomly split into two parts, 80% of them (32 videos from 8 mice) were assigned to the training dataset and 20% of them (8 videos from 2 mice) to the test dataset (Figure 2C).Each video was converted into individual frames, and each frame was classified as "scratching" (within a scratching train) or "non-scratching" (out of a scratching train) based on the reference annotation.For the training dataset preparation, N consecutive frames (a parameter adjusted for optimal model performance) were selected as one input to capture the dynamic features of scratching (Figure 2B).To avoid large sets of redundancy in the training dataset, two adjacent inputs were apart between 4 and 10 frames (Figure 2B).An input was labeled as "scratching" (class 1) if more than half of frames (N/2) in the input were scratching frames; otherwise labeled as "non-scratching" (class 0) (Figure 2B).
Since the dynamic features of scratching behavior spanned multiple frames, the input length (N frames) would be a critical training parameter.In CQ triggered acute nape itch model, the average duration of one cycle of scratching bout and paw licking was around 30 frames (Figures S3A and S3B).Thus, we tested a range of input length from 3 to 45 frames for model training.During the training process, the loss (discrepancy between model prediction and reference annotation) decreased, and the prediction accuracy (correct prediction of both scratching and non-scratching frames / all frames) increased (Figures 2D and 2E).After 10 epochs (one epoch means the training covers the complete training dataset for one round), the accuracy reached a plateau (Figure 2E).The prediction accuracies were more than 0.98 for all input length, improving slightly with the increase of the input length (Figure 2E).These results demonstrate that the designed CRNN network works very efficiently to capture the scratching features and recognize scratching behavior in the training dataset.

Model evaluation on test datasets
We evaluated the performance of the trained prediction models on 8 unseen test videos.First, similar to what described above, each test video was converted into inputs with "N" frames (the same "N" was used for training and test), except that the two adjacent inputs were only 1 frame apart.Second, the trained neural network predicted each input to be "scratching" or "nonscratching".Third, to convert the prediction from one input (containing N frames) into the prediction for each individual frame, we used the following rule: the prediction of the middle frame of each input would be the same as that of the input.For example, if an input was predicted as "scratching", then the middle frame of this input would be a "scratching" frame.This conversion predicted each frame of tested videos as "scratching" or "non-scratching" expect for the few frames at the beginning or at the end of a video (see method for the missing data interpretation).Fourth, recall (the number of correctly predicted scratching frames / the number of reference scratching frames), precision (the number of correctly predicted scratching frames / the number of all predicted scratching frames), and F1 score (2*recall*precision / (recall + precision)) were calculated.Compared to the overall accuracy (correct prediction of both scratching and nonscratching frames / all frames), the recall, precision, and F1 score give a more precise and in depth evaluation of a model's performance (Powers, 2020), especially when scratching is a relative rare event in a video.
To rule out the possibility that the good performance of our models was due to a specific combination of the training and test datasets, we rotated the training and test videos for cross validation.For all different combination of training and test videos, F1 scores were all above 0.9 (Figures S4A and S4B), supporting the stable and high performance of our prediction models.In addition, the prediction model performed better with the increased input length (Figure S4C).The best model, the one trained with videos 1-32 with the input length of 45 (N=45), was selected for additional analyses and tests.
The average recall and precision of the best model were 97.6% and 96.9%, respectively, for the 8 test videos (Figure 3A), and the recall and precision for individual videos were above 95% in most cases (Figure 3B).This performance was similar to or even slightly better than that of manual annotation, which had an average recall and precision of 95.1% and 94.2%, respectively (Figures 3C and 3D).When comparing the total scratching time to the reference annotation, the average discrepancy of the model prediction was 1.9% whereas that of manual annotation was 2.1% (Figure 3E).The correlation between the model prediction and the reference annotation was 0.98, similar to the manual annotation results (Figure 3F).When examining the probability traces of model prediction and the reference annotation (one example shown in Figure 3G), we found that the model successfully recognized almost all the scratching trains in test videos, and that the prediction of the start and end of each scratching train aligned well with the reference annotation (Figure 3H).Taken together, these results demonstrate the high reliability and accuracy of our model to recognize and quantify mouse scratching behavior of new videos.

The trained neural network model focused on the scratching hind paw to recognize mouse scratching behavior
How did the trained neural network model recognize mouse scratching behavior and distinguish them from other behaviors?Although deep learning neural networks are processed as a black box, saliency maps can give some hints (Selvaraju et al., 2017), because they plot which parts of each frame (pixels) were mainly used during model prediction.The most salient parts were centered around the scratching hind paw in the scratching frames (Figures 4A, 4B, and S5A), suggesting that the prediction model focused on the features of the scratching hind paw.In some scratching frames, other body parts were also highlighted, such as the two front paws (Figure 4B), suggesting the model also utilized the positional relationship of these body parts to recognize scratching behavior.In contrast, for other "non-scratching" mouse behaviors, such as wiping, grooming, rearing and locomotion, the "salient" parts showed no clear association with particular mouse body parts (Figures 4C-4F, and S5B-S5E).Together, these saliency maps indicate that the trained neural network learns to focus on the dynamic and static features of scratching for the prediction of mouse scratching behavior.

Model prediction error analysis
To further understand the performance of the best trained neural network model, we systematically analyzed its prediction errors in 8 test videos (Figure S6) and compared to those from the manual quantification (Figure S7) and other trained models.We classified the prediction errors into 5 categories: Type 1, false positive (non-scratching region was predicted as a scratching train); Type 2, false negative (a real scratching train was not recognized); Type 3, blurred boundary (the prediction of the start or end of a scratching train was shifted); Type 4, missed interval (two or more adjacent scratching trains were predicted as one scratching train); and Type 5 split scratching train (one scratching train was predicted as two or more scratching trains) (Figure S6A).We found that the dominant prediction error of the trained neural network model was Type 3 error, accounting for around 3% of the total scratching frames, followed by Type 2 and Type 5 errors accounting for around 1% (Figure S6B).For manual quantification, the major errors came from Type 3 and Type 4, accounting for 10% and 8% of the total scratching frames, respectively (Figure S7A).
For Type 1 error of the model prediction (Figures S6C1-S6C3), the durations of all false positive scratching trains were shorter than 10 frames (0.3 s) (Figure S6C2) and temporarily close to a real scratching train (within 30 frames, < 1 s) (Figure S6C3).They were not caused by confusion with other behaviors, such as wiping, grooming, rearing, locomotion and resting (Figure S8).Type 1 error was also rare for manual annotation (Figure S7B).
The models might miss short scratching trains, hence caused the Type 2 error.Indeed, all missed scratching trains were shorter than 40 frames (< 1.3 s) (Figures S6D1 and S6D2).For all scratching trains lasting less than 30 frames (< 1 s), 18.5% of them were missed by the model prediction.This number decreased to 2.7% for scratching trains spanning between 30 to 60 frames (1-2 s).No scratching train was missed if they were longer than 60 frames (> 2 s) (Figure S6D3).The Type 2 error positively correlated with the input length (N) of prediction models.It became zero or close to zero when models trained with shorter input lengths (3, 13 and 23 frames) (Figures S9A-S9C).For manual annotation, Type 2 error was not common (Figures S7C1 and  S7C2).
Type 3 error (Figures S6E1-S6E3) was dominant among all five type errors.The average start and end frame shift of the model prediction were 2.2 and 7.0 frames (Figure S6E3), while those of the manual annotation were 11.5 and 12.8 frames (Figures S7D1 and S7D2).The start and end frame shift of the manual quantification was similar (~350 ms), which likely reflected the temporal delay of the real time human visual system processing.For the model prediction, which was an after-event frame by frame process, the temporal shifts were less than the human visual processing.In addition, the start of a scratching train was more accurately recognized by the model than the end of a scratching train (Figure 6E3).This might reflect the feature of scratching trains.It was relatively clear to determine the start of a scratching train when a mouse lifted its hind paw, but more ambiguous to determine when a mouse put its hind paw back onto the floor to complete a scratching train.The start and end shift didn't correlate with the length of a scratching train (Figure S9D).Thus, the relative error (percentage of error frames) would decrease when the duration of a scratching train increases.Indeed, the prediction accuracy (as indicated by the F1 score) positively correlated (R 2 =0.5723) with the average scratching train duration in a video (Figure S9E).Type 4 error was caused when two adjacent scratching trains were too close to each other and were predicted as one scratching train (Figures S6F1-S6F3).All missed intervals were shorter than 30 frames (< 1 s) (Figure S6F2).Conversely, 51.4% of intervals less than 30 frames between the two adjacent scratching trains were not recognized.All intervals longer than 30 frames were recognized (Figure S6F3).Type 4 error was more common in manual annotation than in the model prediction (Figures S7E1 and S7E2).
Type 5 error occurred when one scratching train was predicted as two or more scratching trains, separated by mispredicted intervals.The average lengths of these mispredicted intervals were around 10 frames by model prediction and around 40 frames by manual annotation (Figures S6G1-S6G2, and S7F).When reviewing these intervals, we found that more than 80% of them were within or partially overlapped with a paw licking phase (Figure S6G3), especially when the duration of the paw licking was more than 30 frames (Figure S9F).Thus, it seems likely that the model predicted some long licking frames within a scratching train as "non-scratching".Type 4 and Type 5 errors reflect the intrinsic complexity of scratching behavior.Human definitions and field consensus, such as using "2s" as the threshold between two adjacent scratching trains (Darmani and Pandya, 2000), would help to reduce these types of errors.
In summary, we have established a novel system combining a customized videotaping box and a well-trained CRNN neural network (N45) to automatically identify and quantify mouse scratching behavior with high accuracy.We named it as the Scratch-AID system.

Performance of the Scratch-AID system on other major acute itch models
In addition to the nape, the other commonly used body location to induce mouse itch sensation is the cheek (Shimada and LaMotte, 2008).To test whether the Scratch-AID system trained by the nape CQ model could also recognize and quantify scratching behavior of the cheek model, we injected CQ (200 μg in 15 μl saline) into the cheek of 5 wildtype mice (3 males and 2 females) and recorded 7 videos (Figure 5A).We compared the scratching behavior quantification by the Scratch-AID and manual annotation.The Scratch-AID prediction had 93.4% recall, 94.8% precision, and 0.941 F1 score (Figure 5B), while those of manual quantification were 96.0%, 88.6%, and 0.919 (Figure 5C).The correlation between the Scratch-AID prediction and reference annotation was 0.9926, and that of the manual quantification was 0.9876 (Figures 5D and 5E).The total scratching time in individual videos from both model prediction and manual annotation were close to the reference annotation (Figure 5F).These results demonstrate that the Scratch-AID system can reliably identify and quantify scratching behavior triggered by acute itch sensation in the cheek.
Different pruritogens administrated at the same body location trigger scratching behavior with different dynamic features (Wimalasena et al., 2021).Thus, we tested whether the Scratch-AID system trained by CQ injection could recognize scratching behavior triggered by a different pruritogen, histamine.100 μg histamine (in 15 μl saline) was intradermally injected into the nape and 4 videos were recorded (Figure 5G).The recall, precision, and F1 score from the Scratch-AID prediction were 96.6%, 90.91%, and 0.936 (Figure 5H), while those of the manual annotation were 96.3%, 80.5%, and 0.877 (Figure 5I).The correlation between the Scratch-AID prediction and reference annotation is 0.9707, and that of the manual quantification is 0.9895 (Figures 5J  and 5K).The total scratching time in individual videos from both model prediction and manual annotation were similar to those of reference annotation (Figure 5L).Taken together, these data show that our Scratch-AID system, although only trained with the CQ nape acute itch model, can recognize and quantify scratching behavior of acute itch models induced at different skin locations or triggered by different pruritogens, and that the prediction accuracy of Scratch-AID is comparable to that of the manual annotation.

Performance of the Scratch-AID system on a chronic itch model
Chronic itch is a devastating symptom and severely affects the quality of patients' life (Yosipovitch and Fleischer, 2003;Yu et al., 2021).Investigating the underlying mechanisms using the mouse model is important for developing novel therapies for treating chronic itch in various conditions.To test whether the Scratch-AID system could be used to study mouse chronic itch models, we generated a squaric acid dibutylester (SADBE) induced contact dermatitis chronic itch model (Beattie et al., 2022;Qu et al., 2015) and recorded 9 videos from 3 wildtype mice (Figure 6A).Affected mice displayed spontaneous scratching towards the nape and/or head.Noticeably, dynamic features of spontaneous scratching behavior under this chronic itch condition were different from those exhibited by the CQ acute itch model: the total scratching time was less for the same given period of time (20 min) (Figure S10A), and the average duration of the scratching trains was shorter (53 frames on average) than the acute scratching behavior induced by CQ (280 frames on average) (Figures S10B and 1F).Despite these considerable differences, the recall, precision, and F1 score of the Scratch-AID prediction were 84.1%, 88.5% and 0.862, respectively, compared to 84.1%, 66.3%, and 0.740 of manual annotation (Figures 6B and 6C).A likely reason for the decreased recall and precision of both model prediction and manual annotation was the dominant short scratching trains in this chronic itch model (Figure S10B).The correlation between the model prediction and reference annotation was 0.9845 (Figure 6D), which was comparable to the manual annotation 0.9887 (Figure 6E).The total scratching time quantified by the model was slightly more accurate than the manual annotation (Figure 6F).From the prediction traces (an example shown in Figure 6G), the Scratch-AID system was capable to capture the low-frequency short scratching trains.

Application of the Scratch-AID system in anti-itch drug screening
Finally, we tested whether the Scratch-AID system could be applied for anti-itch drug screening.Here we used the histaminergic acute itch nape model and the pretreatment of Benadryl (Loew et al., 1946), a FDA approved anti-histaminergic itch cream, as a proof-ofprinciple example.Benadryl or control cream was topically applied onto the mouse nape skin 1 hour before the intradermal injection of histamine (200 μg in 15 μl saline).Scratching behavior was then recorded for 20 min from 6 C57 wildtype mice (2 male, 4 female) in the Benardryl treated group and 7 C57 wildtype mice (2 male, 5 female) in the control group (Figure 7A).Quantification of the total scratching time (frames) using the Scratch-AID showed a significant reduction with Benardryl treatment (Figure 7B).Similar results were found by manual annotation (Figure 7C).These results suggest that the Scratch-AID system is sensitive to detect the change of scratching behavior after an anti-itch drug treatment, which highlights the potential use of the Scratch-AID system in high throughput and large-scale anti-itch drug screenings.

Discussion:
Scratching is an itch specific behavior, and the mouse is the major model animal to study itch mechanisms and to develop novel anti-itch drugs.In this study, we developed a new system, Scratch-AID, which combined a customized videotaping box and a well-trained neural network, for automatic quantification of mouse scratching behavior with high accuracy.Its performance is comparable to the manual annotation on major itch models and an anti-itch drug screening paradigm.As far as we are aware, this is the first deep-learning based system that can achieve such high accuracy and broad generalization.
It is remarkable that a model trained with CQ induced acute nape itch videos could reliably recognize scratching behavior in other itch models, even when the body sites (nape vs. cheek) or the dynamics of scratching behavior (e.g., acute vs chronic itch, different pruritogens) were different.The impressive performance and generalization of the Scratch-AID system is likely attributed to the high-quality and reproducible video recording; a large amount of high-quality training datasets with the frame-by-frame annotation; efficient CRNN deep learning neural network design; and training parameter optimizations.
Variable video recording condition is a major barrier that has prevented the adoption of a trained neural network by different labs.The trained models usually don't perform well with videos recorded under different conditions (illumination, field size, image resolution, magnification of mice, image angle, backgrounds, etc.).Thus, we built a customized videotaping box to provide a reproducible and high-quality recording environment (Figure 1B).This helps to standardize the videotaping process, reduce the noise, and facilitate the stable performance and generalization of the trained deep-learning models.Meanwhile, this video recording box is easy to be set up, scaled up, and adopted.Combined with the well-trained prediction model that has a comparable performance as manual quantification, the Scratch-AID system is ready to replace the manual quantification.
The amount and quality of training datasets are important to train a high accurate and generalized model.In general, the performance of a prediction model positively corelates with the size of the training dataset.In this study, we recorded 40 scratching videos in total and chose the CQ induced acute nape itch model, which triggered robust scratching behavior in mice, for model training and testing.The large number of scratching videos and the high amount of incidence of targeted behavior in each video provided a high-quality training dataset.In addition, the clear definition of scratching train and accurate annotation of videos, as conducted frame by frame, were also critical for training, testing, and error analysis.
CRNN is the classical deep learning neural network for analysis of animal behaviors.Here, we used ResNet (He et al., 2016) for the CNN part, which simplifies the neural network model by constructing a residual learning block through a shortcut connection of identity mapping (Figure S2B).GRU was used for the RNN part for extracting dynamic features (Figure S2C).Our study shows that ResNet and GRU make a good combination of deep learning architecture for analyzing animal behaviors.In addition, our network design is highly efficient that the accuracy plateau was achieved after only 10 epochs of training.
There is still some room to improve the performance of our predication models.To increase the capability of Scratch-AID to capture short scratching trains, we could train the CRNN neural network with scratching videos from chronic itch models.In addition, optimization of training parameters could also help with the improvement of prediction accuracy.For example, different error types varied when changing the input length (Figures 6B and S9A-S9C).The Type 2 and Type 4 errors increased when the input length increased, while Type 1, Type 3 and Type 5 errors decreased when the input length increased.Thus, it's a trade-off to optimize and select the best input length for different itch models.Increasing the size of the training dataset may also help to get better prediction models.When checking the videos with relatively low prediction accuracy, for example, the video number 5 (V5) in the chronic itch model (Figure 6F), we found that the missed scratching frames could be due to a rare posture during scratching behavior, in which the scratching hind paw was partially occluded by the tail (Video S3).Thus, with more training videos containing some rare scratching postures, the trained neural network models could be more "knowledgeable" for the diversity of scratching behavior.
Quantification of animal behaviors is critical for understanding the underlying molecular, cellular and circuit mechanisms.Compared to manual analysis, the deep learning based automatic analysis will not only improve efficiency and accuracy, but also reduce human bias and errors.Along this direction, we developed the Scratch-AID system to achieve automatic quantification of mouse scratching behavior.Our study also provides useful insights for developing new deep learning neural network models to achieve automatic analysis of other animal behaviors.

Figure S2. The architecture of deep learning neural network (A)
The cartoon showing the architecture of designed CRNN neural network.The input was first fed into CNN (ResNet-18), then followed by a 2-layer bidirectional gated recurrent unit (GRU).A final FC layer generated the binary prediction results.(B) Details of the modified ResNet-18 network.The final FC layer was modified to output frame-wise embedding instead of classification result.Conv, convolutional layer; Batch norm, batch normalization layer; ReLU: Rectified linear unit; "+", element wise plus for vector.(C) Details of one gated recurrent unit (GRU) of the 2-layer bidirectional GRU.The unit connected to embedding of frame t (x t ) took the last unit's output (h t-    The incidence rate of each type of errors, calculated by the ratio of the total frames in each type error over the total scratching frames of 8 test videos.Error bar, S.E.M. (C1-C3) A real example of false positive prediction (C1), the duration of all false positive scratching trains (C2), and the frequency distribution of distances between false positive scratching trains to the nearest real scratching train (C3).(D1-D3) A real example of false negative prediction (D1), the duration of all false negative scratching trains (D2), and the Type 2 error rate for scratching trains with different duration (D3).(E1-E3) A real example of blurred boundary prediction (E1), the distribution (E2), and the average lengths (E3) of start and end shift.Error bar, S.E.M. (F1-F3) A real example of missed interval (F1), the duration of all missed intervals (F2), and type 4 error rate for intervals with different duration (F3).(G1-G3) A real example of split scratching train (G1), the length of split frames (G2), and distribution of the type 5 error linked to paw licking (G3).

Mouse lines and treatments
Mice were housed in the John Morgan animal facility at the University of Pennsylvania.All animal treatments were conducted in accordance with protocols approved by the Institutional Animal Care and Use Committee and the guidelines of the National Institutes of Health.

Acute itch model
Acute itch models were conducted as previously described (Cui et al., 2022).Mouse cheeks or napes were shaved one to three days prior to experiments, and mice were placed in the videotaping box for acclimation (15 min/day for three days).At the day of experiment, mice were first acclimated to the videotaping box for 5 min, chloroquine (Sigma, C6628) (200 ug in 15 μl saline) or histamine (Sigma, H7250) (100 ug in 15 μl saline) was then intradermally injected into either the cheek or the nape, and scratching behavior was recorded for 20 min.

Behavior recording
The mouse scratching behavior was recorded using a Microsoft laptop (Windows 10 Pro, purchased from Amazon) connecting to a web camera (Logitech C920e Business Webcam, purchased from Amazon).Logitech Capture 2.06.12 Software was used to adjust the following recording parameters: brightness 170, contrast 0, resolution 720X720, frame rate 30 fps.The brightness and contrast could be adjusted according to ambient light to achieve consistent illumination.

Annotation of mouse scratching behavior
The start of a scratching train was defined as when the mouse started to lift up its hind paw and prepared to scratch at the beginning of a scratching train.For the end of a scratching train, there were two cases.If the mouse did not lick its hind paw after last scratching bout, the end frame would be when the mouse put its hind paw back onto the ground; if the mouse licked its hind paw after last scratching bout, the end frame would be when the mouse put its hind paw into the mouth.
For the manual annotation, each video was manually watched at the normal (1X) speed (30 frames/second).The start and end time point of each scratching train was manually annotated, the time point was converted into the frame number (30 frames/second) for downstream analysis.
For the reference annotation, each video was first converted into individual frames using python package OpenCV (Bradski, 2000).Then the start and end of each scratching train was determined frame by frame.Frames within a scratching train were defined as "scratching" frames, otherwise defined as "non-scratching" frames.

Training sample preparation and preprocessing
In the training procedure, 40 videos from 10 mice were first randomly split into 32 training videos from 8 mice (V1-V32), and 8 test videos from 2 mice (V33-V40).For cross validation, training and test videos were rotated.Test videos were V1-V8, V9-V16, V17-V24, V25-V32 or V33-V40, and the rest of videos were used as training videos.For the model training, input was prepared by the following procedure: First, each video was converted into frames; then, consecutive N frames (N = 3, 5, 7, 13, 23 or 45) were selected as one input with an interval of 4-10 frames between two adjacent inputs.Then each input was annotated as "scratching" (class 1) if more than N/2 frames were scratching frames, otherwise labeled as "non-scratching" (class 0).
All frames in each input were first converted into grey scale images and resized to 300X300.Then a random square crop with size ranging from 288X288 to 300X300 was applied and followed by a random horizontal and vertical flip with probability of 0.5.Finally, these frames were resized to 256X256 and fed into the CRNN network.

CRNN architecture
Our CRNN model (Figure 2A and S2) consisted of a CNN part, an RNN part and a full connection (FC) part.The CNN was modified from ResNet-18 (He et al., 2016) by changing the final FC layer into a FC layer with embedding size 256.The RNN consisted of 2 bidirectional GRU layers with hidden vector size 512.The FC part consisted of two FC layers with embedding size 256 and ReLu activation, and embedding size 2 for the final prediction.The prediction results were transformed into maximum value in the final output vector.

Model training and predication
The model was trained by PyTorch (version 1.10.2) (Paszke et al., 2019) .For model training hyperparameters, batch size was set as 16 or 32 depending on the input size.The max epoch was set as 20.ADAM optimizer was used with initial learning rate 10 -4 and the learning rate reduced by multiplying factor 0.3 every 5 epochs.The binary cross entropy was used as loss function.Dropout rate of FC layer was set as 0.2.The model was trained on a customized desktop with Intel i9-10900k CPU (purchased from Newegg), 64 GB RAM (CORSAIR Vengeance LPX 64GB, purchased from Newegg) and NVIDIA GeForce RTX 3090 with 24 GB memory (purchased from Amazon).
For model prediction on new videos, the input preparation was similar to the training dataset, except that the adjacent inputs were only 1 frame apart.Each individual frame was predicted as "scratching" or "non-scratching" based on the following rule: the prediction of the middle frame of each input would be the same as the input prediction.For the few frames at the beginning or at the end of each video (depending on the input length N) that could not be the middle frame of an input, they were predicted as "scratching" or "non-scratching" based on the first input or last input prediction.

Saliency map
To obtain the saliency map, we first calculated gradient value of each pixel and kept the only positive gradients, then rescaled into range 0 to 1 based on a previous published paper (Selvaraju et al., 2017).Then we generated the heatmap based on the gradient values (< 0.1 transparent, 0.1-0.6 light blue to dark blue, > 0.6 dark blue) and stacked the heatmap onto the original frame.

SADBE chronic itch model
Mice were singly housed before starting the behavior experiment.In day 1, mice were individually anesthetized in chamber using isoflurane until they didn't move and showed decreased respiration rates.Mice were continued for anesthesia with a nose cone to allow access to the body of the mouse.After ensuring a mouse was fully anesthetized, the abdominal skin was shaved.In day 4, after anesthesia similar in day 1, 25 uL 1% SADBE in acetone (Sigma, 3339792) was applied to the shaved area of the abdominal skin.After application of SADBE, anesthesia continued for 3 more minutes to ensure SADBE fully absorbed before putting then back in the home cage.In day 9-12, mice were habituated in the videotaping box for 1 hour and 15 minutes.In day 11, the nape was shaved after habituation.In day 13-18, mice were habituated in the videotaping box for 1 hour and 15 minutes, then 25 uL of 1% SADBE was applied to the nape.In day 18, after 1% SADBE application, each mouse was put back into the videotaping box and recorded for three videos.

Figure 1 .
Figure 1.The overall workflow and building a customized videotaping box for mouse scratching behavior recording (A) A diagram showing the workflow to develop a deep learning-based system for automatic detection and quantification of mouse scratching behavior.(B) An image of the designed videotaping box for high-quality video recording of mouse scratching behavior.Scale bar, 5 cm.(C) A cartoon showing the acute itch model induced by the CQ injection in the nape, followed by video recording in the customized videotaping box.(D) Representative images showing different phases (P1-P4) of a scratching train (upper).Red arrows indicate the scratching hind paw.A cartoon showing the dynamic movement of the scratching hind paw in a scratching train (bottom).The cycle of scratching bout (P2) and paw licking (P3) may repeat once or more times in a scratching train.Scale bar, 1 cm.(E) The total number of scratching trains in each video.(F) The distribution of scratching train duration (n= 1135 scratching trains).The inset is the zoom-in of the red rectangle.

Figure 2 .
Figure 2. Deep learning neural network design and training (A) Cartoon showing the architecture of designed deep learning neural network consisting of the combination of CNN, RNN and classifier.(B) Cartoon showing the preparation of inputs for the training dataset.Consecutive N frames were selected as one input for training.The interval between two adjacent inputs in a video were 4-10 frames.(C) The information of a sample training and test datasets.(D, E) The training loss decreased (D) while the accuracy increased (E) during the training process with different input length (N = 3, 5, 7, 13, 23, 45 frames).The inset is the zoom-in of part of the figure.

Figure 3 .
Figure 3. Performance of the best model on test videos (A, B) The recall, precision and F1 score of the best model on average (A) or in individual videos (B).(C, D) The recall, precision and F1 score of manual annotation on average (C) or in individual videos (D).(E) The comparison among model prediction, manual quantification, and the reference annotation.The reference annotation is normalized to 100% shown as the red line.(F) The correlations between model prediction or manual quantification and reference annotation.R 2 , Pearson correlation coefficient.(G) An example scratching probability trace (red curve) predicted by the model and aligned with the reference annotation (green bar).(H) The two zoom-ins from (G) showing the nice alignment between the model prediction and the reference annotation.

Figure 4 .
Figure 4.The prediction model focused on the scratching hind paw for scratching behavior recognition (A, B) Saliency map showing the gradient value of each pixel of scratching frames during mouse scratching behavior prediction by the best model.The model focused on the scratching hind paw (A, B) and other body parts, such as front paws (B).Scale bar, 2 cm.(C-F) Saliency map showing the gradient value of each pixel of wiping (C), grooming (D), rearing (E) and locomotion (F) frames during mouse scratching behavior prediction by the model.

Figure 5 .
Figure 5.The Scratch-AID performance on other acute itch models (A) A cartoon showing an acute itch model induced by CQ injection in the mouse cheek.(B, C) The average recall, precision and F1 score of Scratch-AID (B) or manual annotation (C).Error bar, S.E.M. (D, E) The correlation between model prediction (D) or manual quantification (E) and reference annotation.R 2 , Pearson correlation coefficient.(F) The comparison among model prediction, manual quantification, and reference annotation.The reference annotation is normalized to 100% shown as the red line.(G) A cartoon showing an acute itch model induced by histamine injection in the mouse nape.(H, I) The average recall, precision and F1 score of

Figure 6 .
Figure 6.The Scratch-AID performance on a chronic itch model (A) A cartoon showing a SADBE induced chronic itch model.(B, C) The average recall, precision and F1 score of Scratch-AID (B) or manual annotation (C).Error bar, S.E.M. (D, E) The correlation between model prediction (D) or manual quantification (E) and reference annotation.R 2 , Pearson correlation coefficient.(F) The comparison among model prediction, manual quantification, and reference annotation.The reference annotation is normalized to 100% shown as the red line.(G) An example scratching probability trace (red curve) predicted by the model and aligned with the reference annotation (green bar) (left).Zoom-in (right panel) of the blue square part showing nice alignment of the model prediction and the reference annotation.

Figure 7 .
Figure 7. Application of the Scratch-AID systemin a drug screening paradigm (A) A diagram showing the experimental design of an anti-itch drug test.(B, C) Quantification of scratching behavior in anti-itch cream treated group or control group by Scratch-AID (B) or manual annotation (C).Error bar, S.E.M.

Figure S1 .
Figure S1.Dynamic and static features of scratching behavior in the CQ nape acute itch model (A, B) the scratching hind paw (red arrow in A) but not the contralateral hind paw displayed rhythmic vibration during scratching behavior (B).(C-F) The distance between scratching hind paw and nose (C) or mouth (D), the angle consisting of nose, scratching hind paw, and ipsilateral front paw (E), and angle consisting of scratching hind paw, mouth, and tail (F) in the scratching or non-scratching frames.(N=1756 frames for non-scratching, 2214 frames for scratching).Error bar, S.E.M.

Figure S3 .
Figure S3.Distribution of durations of scratching bout and paw licking in the CQ nape acute itch model (A, B) The frequency distribution of scratching bout (A) or paw licking (B) durations.The inset figure is the zoom-in of the red square part.

Figure S4 .
Figure S4.Cross validation and parameter optimization of the prediction models (A) Cross validation of the trained prediction models by rotating the training and test datasets in 40 videos.The F1 score was calculated on the 8 test videos.(B) The best performance of the N45 models trained with different training and test dataset combos.(C) Model performances with different input lengths.

Figure S5 .
Figure S5.Saliency map of mouse scratching and other behaviors during the predictionAdditional saliency maps showing the gradient value of each pixel for frames during scratching (A), wiping (B), grooming (C), rearing (D) and locomotion (E) during the scratching behavior prediction of the best model.

Figure S6 .
Figure S6.Error analysis of the best prediction model (A) Cartoon showing five types of prediction errors.Red curves indicated scratching probability predicted by the model, and green bars indicated reference scratching trains.(B)The incidence rate of each type of errors, calculated by the ratio of the total frames in each type error over the total scratching frames of 8 test videos.Error bar, S.E.M. (C1-C3) A real example of false positive prediction (C1), the duration of all false positive scratching trains (C2), and the frequency distribution of distances between false positive scratching trains to the nearest real scratching train (C3).(D1-D3) A real example of false negative prediction (D1), the duration of all false negative scratching trains (D2), and the Type 2 error rate for scratching trains with different duration (D3).(E1-E3) A real example of blurred boundary prediction (E1), the distribution (E2), and the average lengths (E3) of start and end shift.Error bar, S.E.M. (F1-F3) A real example of missed interval (F1), the duration of all missed intervals (F2), and type 4 error rate for intervals with different duration (F3).(G1-G3) A real example of split scratching train (G1), the length of split frames (G2), and distribution of the type 5 error linked to paw licking (G3).

Figure S7 .
Figure S7.Error analysis of the manual quantification (A) The incidence rate of each type of errors in manual annotation, calculated by the ratio of the total frames in each type error over the total scratching frames of 8 test videos.Error bar, S.E.M. (B) The duration of all false positive scratching trains.(C1-C2).The duration of all false negative trains (C1), and the Type 2 error rate for scratching trains with different duration (C2).(D1-D2) The distribution (D1) and the average lengths (D2) of start and end shift.(E1-E2) The duration of all missed intervals (E1), the type 4 error rate for intervals with different duration (E2).(F) The length of split frames.

Figure S8 .
Figure S8.Other mouse behaviors were not recognized as scratching behavior (A-E) Examples of scratching behavior prediction probability during wiping (A), grooming (B), rearing (C), locomotion (D) and resting (E).

Figure S9 .
Figure S9.Relationship between prediction errors and the input length or the scratching train duration (A-C) The five type error rates of models trained with different input lengths.Error bar, S.E.M. (D) The correlation between the scratching train duration and the length of start shift, end shift, or start shift plus end shift.(E) The correlation between the average scratching train duration in a video and the prediction accuracy (F1 score).(F) Frequency distribution of all paw licking duration and those linked to Type 5 error.

Figure S10 .
Figure S10.Different dynamic features of chronic and acute itch models (A) The percentage of scratching and non-scratching frames in the SADBE chronic itch model (n = 9 videos) and CQ nape acute itch model (n = 40 videos).(B) Frequency distribution of scratching train duration of SADBE chronic itch model and CQ nape acute itch model.

Fig
Fig. 1: The overall workflow and build a customized videotaping box for mouse scratching behavior recording