Multi-Object Tracking in Heterogeneous environments (MOTHe) for animal video recordings

Aerial imagery and video recordings of animals are used for many areas of research such as animal behaviour, behavioural neuroscience and field biology. Many automated methods are being developed to extract data from such high-resolution videos. Most of the available tools are developed for videos taken under idealised laboratory conditions. Therefore, the task of animal detection and tracking for videos taken in natural settings remains challenging due to heterogeneous environments. Methods that are useful for field conditions are often difficult to implement and thus remain inaccessible to empirical researchers. To address this gap, we present an open-source package called Multi-Object Tracking in Heterogeneous environments (MOTHe), a Python-based application that uses a basic convolutional neural network for object detection. MOTHe offers a graphical interface to automate the various steps related to animal tracking such as training data generation, animal detection in complex backgrounds and visually tracking animals in the videos. Users can also generate training data and train a new model which can be used for object detection tasks for a completely new dataset. MOTHe doesn’t require any sophisticated infrastructure and can be run on basic desktop computing units. We demonstrate MOTHe on six video clips in varying background conditions. These videos are from two species in their natural habitat—wasp colonies on their nests (up to 12 individuals per colony) and antelope herds in four different habitats (up to 156 individuals in a herd). Using MOTHe, we are able to detect and track individuals in all these videos. MOTHe is available as an open-source GitHub repository with a detailed user guide and demonstrations at: https://github.com/tee-lab/MOTHe-GUI.

to the following tasks. 23 1. System configuration: The system configuration is used to set up MOTHe on the user's system. Basic 24 details such as the path to the local repository, the path to the video to be processed, the size of the individual 25 to be cropped and the size of the bounding box to be drawn during the detection phase. 2. Dataset generation: The dataset generation is a crucial step towards object detection and tracking. The 27 manual effort required to generate the required amount of training data is huge. The data generation class 28 and executable semi-automate the process by allowing the user to crop the region of interest with simple 29 clicks over a GUI and automatically save the images in appropriate folders. For our models, we used roughly 30 9k animal images and 18k background images. These images were selected from a subset of frames from 31 all the videos (to be able to cover various background contexts and animal representations). We selected 32 roughly 15 sparsely spaced frames from 45 videos and cropped animals and background regions to generate 33 the training data. 34 detected individuals and generates their trajectories. We have separated detection and tracking modules so 48 that they can also be used by someone interested only in the count data (eg. surveys). This modularisation 49 also provides flexibility in using more sophisticated tracking algorithms for experienced programmers. We 50 use an existing code for the tracking task -https://github.com/ctorney/uavTracker. This algorithm uses 51 Kalman filters and the Hungarian algorithm. This script can be run once the detections are generated in 52 the previous step. Output is a .csv file which contains individual IDs and locations for each frame. Video 53 output with the unique IDs of each individual is also generated. To perform the detection task, we first need to identify the areas in an image where the object can be found, 56 this is called localisation or region proposal. Then we classify these regions into different categories (e.g. whether 57 an animal or background?), this step is called classification. The localisation step is performed using an efficient 58 thresholding approach that restricts the number of individual classifications that need to be performed on the 59 image. As discussed earlier, the colour thresholding approach doesn't perform well in complex videos and in most 60 cases, there is a trade-off between false positives and missed identifications. To utilise colour thresholding for 61 region proposal, we err on the side of false positives so that we get all the potential regions where an animal can 62 be found. These identified keypoints are used to run the classification step. We selected the network architecture using a trial-and-error approach on several networks having 6-8 convolutional 65 layers and varying parameters for these layers. We also explored different parameters for activation functions and 66 dropout layers. Out of these networks, the best-performing architecture on the validation dataset was chosen.

67
Here, we explain the architecture of MOTHe.    The working of MOTHe was tested out by using it to track Blackbucks in aerial footage shot in the wild. The  P recision = T rue P ositives T rue P ositives + F alse P ositives 2. Recall: It measures the tendency of the model to predict a false negative. The higher the recall value, the lower the model's tendency to predict a false negative.
Recall = T rue P ositives T rue P ositives + F alse N egatives 3. Accuracy: It measures the overall performance of the model. A higher accuracy signifies better performance of the model on the dataset. Accuracy = T rue P ositives + T rue N egatives T rue P ositives + F alse P ositives + T rue N egatives + F alse N egatives The performance of the CNN can be tuned by changing the threshold probability using which a prediction is made, 133 i.e. if the probability of a "positive" (animal is present in the cropped version of the frame which is predicted by 134 the model) is greater than the threshold probability, then the prediction of the model is taken to be a "positive" 135 else it is taken to be a "negative".

136
So it can be seen that on increasing threshold probability, the lower will be the number of false positives, which 137 will lead to higher precision and the higher will be the number of false negatives, which will lead to a lower recall.

138
On decreasing the threshold probability, the precision will fall due to higher false positives and the recall will rise 139 due to lower false negatives. This is called the "precision/recall tradeoff". The precision/recall tradeoff has been 140 plotted in Fig 1-(a) and the accuracy of the model has been plotted against threshold probability in Fig 1-(b).

141
The threshold probability is used to tune the performance of the model. If the predicted probability (by the CNN) 142 of an animal being present in a frame is greater than the threshold probability then the prediction is taken to be 143 a "positive" (animal is present in frame), else it is taken to be as a "negative" (animal is not present in frame).

144
The accuracy is calculated as a ratio of the correct predictions and the total test cases.   for blackbuck videos which are quite complex in terms of background heterogeneity and colour contrast. Blackbuck 158 videos have many other challenges such as the similar colour of animals and background, movement of background 159 objects such as grass, shrubs and other animals and stillness of many blackbuck individuals over a long duration.

160
The animal and background similarity makes it difficult to apply the colour thresholding approach and movement 161 in background and animal stillness, making it difficult to get detections using image subtraction. Since MOTHe is aimed towards providing a complete user-friendly setup for the steps related to visual animal 164 tracking in videos, we also quantify the performance of our tracking module. We have calculated the track length 165 (measured in seconds) for two videos each from blackbuck and wasp datasets. Tracking length was computed for 166 all the individuals in these clips for a duration of 30 seconds and time was noted until the track ID changed for 167 the first time. We also include the track length for the second ID within these 30-second windows. The initial 168 ID for every individual is noted along with the time the individual is tracked with consistent IDs. In case of ID 169 reassignment due to a mistrack, the new ID is noted along with the time the new ID persists.

170
A key assumption made to define tracking metrics is that one ID change (and hence mistrack) is allowed.

171
An individual is considered to have lost track after a second mistrack/ ID change. We present the median and   We calculated the time for which the 1st and 2nd assigned IDs lasted for all the individuals in these videos.