Automated Video Surveillance for the Study of Marine Mammal Behavior and Cognition

- Systems for detecting and tracking social marine mammals, including dolphins, can provide data to help explain their social dynamics, predict their behavior, and measure the impact of human interference. Data collected from video surveillance methods can be consistently and systematically sampled for studies of behavior, and frame-by-frame analyses can uncover insights impossible to observe from real-time, freely occurring natural behavior. Advances in boat-based, aerial, and underwater recording platforms provide opportunities to document the behavior of marine mammals and create massive datasets. The use of human experts to detect, track, identify individuals, and recognize activity in video demands significant time and financial investment. This paper examines automated methods designed to analyze large video corpora containing marine mammals. While research is converging on best solutions for some automated tasks, particularly detection and classification, many research domains are ripe for exploration

Systems that detect and track social marine mammals provide data that can be crucial for understanding their social dynamics.Using these data, researchers can uncover regularities in long term behavior and build models to predict social interactions.Additionally, recording the position and movements of these mammals during social interactions is essential if researchers wish to establish more nuanced relationships between behavioral contexts and the vocalizations that occur within them.As recording technology becomes cheaper and more portable, researchers are collecting more data than ever before.Just as it was recognized that understanding animal communication systems would require advances in analytic tools for ever increasing datasets (McCowan, Hanser, & Doyle, 1999), it is now apparent that in order to adequately address and make effective use of this new wealth of video data, we must develop ways to automate the processing of it.
One possible non-video solution would be to attach devices, such as Digital Acoustic Recording Tags (DTAGs) (Johnson & Tyack, 2003), to every animal in a group in order to obtain GPS coordinates, depth, and movements for all interactions (Tyson, Friedlaender, Ware, Stimpert, & Nowacek, 2012).Although this technique gives very accurate location data, the limited availability and high price tag of DTAGs make it prohibitively expensive to tag many animals.Even with multiple tags, it would take considerable effort to attach a device to every animal in a study group and ensure that they all remain attached for an overlapping duration.In addition, dolphins will modify their swimming behavior when these tags are attached (van der Hoop et al., 2014) indicating that the tagging process is intrusive and may interfere with group dynamics.Acoustic tags, or the use of hydrophone arrays, can help identify the position of animals over time through triangulation, but unless the animals are constantly vocalizing, they cannot be continually tracked.
Investigating marine mammal behavior and cognition requires a systematic and consistently applied sampling regime in order to produce unbiased data (Nowacek, Tyack, & Wells, 2001).Video surveillance constitutes an inexpensive and viable option for accomplishing this.Video provides the raw data -the physical record of events -from which conclusions can be drawn.Video provides an alternative to deciding on a schema for behavior ahead of time and allows for the re-sampling of behavior, the scoring of new features, and the reevaluation of activity.The ability to sample and retain hundreds or thousands of interactions between animals provides a substrate for understanding animal relationships at multiple levels.Researchers, after studying how relationships develop, can re-investigate how they code for and interpret the interactions underlying those relationships.Having a digital record supports the ability to re-evaluate the same data given new hypotheses (Forster, 2002).Finally, video provides us with the ability to slow the viewing rate to reveal previously undetectable behaviors.Frame-by-frame analyses can provide the minute details that underlie and help explain behavioral sequences and their dynamics (see Herzing & Johnson, 2015;Johnson, 2001Johnson, , 2010)).
Video data collected in these studies require skilled annotators to (1) detect and localize animals, (2) track individual animals, (3) identify individual animals, and (4) classify behaviors/activity.Each of these tasks is incredibly time consuming and prone to errors.For example, it can take, on average, one hour to analyze and annotate one minute of video when conducting such micro-analysis, and this grows exponentially more time consuming as one increases the size of the video network.
The following sections will cover the variety of video collection methods currently being employed in research as well as the automated methods that have helped tackle tasks typically designated to trained coders.

Boat and Shoreline Based Methods
Data collection.Traditional boat based observation enables researchers to apply their expertise to identify species, discriminate individuals, and recognize individual or group behaviors at the ocean's surface.With enough training, researchers can become quite skilled at these classification tasks.Researchers also employ photographs if the ongoing activity is too rapid or there are too many individuals to simultaneously identify.Through repeated surface observations or post-production images, this research reveals features that provide huge increases in discriminability.For example, the ventral fluke patterns of humpback whales (Megaptera novaeangliae; Carlson, Mayo, & Whitehead, 1990;Dalla Rosa, Freitas, Secchi, Santos, & Engel, 2004), the lateral body pigmentation of Minke whales (Balaenoptera acutorostrata; Arnold, Birtles, Dunstan, Lukoschpk, & Matthews, 2005;Dorsey, Stern, Hoelzel, & Jacobsen, 1990), the scarring of fin whales (Balaenoptera physalus; Agler et al., 1990), the natural markings and saddle patches of pilot whales (Globicephala melas; Auger-Méthé & Whitehead, 2007), and the callosity patterns of right whales (Würsig & Jefferson, 1990) contain enough variability for accurate individual discrimination from boat based observations.To identify individual bottlenose dolphins, researchers focus on the unique markings and scars present on the dorsal fins and bodies of individuals as well as on the contours of the dorsal fin (Van Hoey, 2013).Maintaining databases of these images allows researchers to reevaluate the features used for identification.For example, in the case of blue whales (Balaenoptera musculus) researchers were able to more efficiently identify individuals by including dorsal fin shape in their procedures that previously utilized only body pigmentation (Gendron & Ugalde de la Cruz, 2012).
Identifying individuals.The time-consuming and arduous task of matching photographs of dolphin dorsal fins to known individuals is one of the first tasks that was successfully addressed through automation.Early on, experts systematized their knowledge into a framework for extracting features from the shape of the dorsal fin for comparison against other known dolphin fins to determine identity (Würsig & Würsig, 1977).This process led to software systems, most notably FinScan (Hillman et al., 2003) and DARWIN (Stanley, 1995;Stewman, Stanley, & Allen, 1995), designed to facilitate this comparison process.This process was still time consuming, as it required that users first trace the contour of the fin, but the work has made automatic contour extraction possible (Hale, 2008;Kreho et al., 1999).Additional work has been done to refine and improve this process, including automatically correcting for orientation and pose (Gilman, Dong, Hupman, Stockin, & Pawley, 2013;Stewman, Debure, Hale, & Russell, 2006).
One of the main difficulties with automating this process is that the background can be noisy and constantly moving.Additionally, depending on the camera or atmospheric conditions, the image may be of poor quality, further reducing the success of identification algorithms.Current methods to identify individual dolphins from their fins are predominantly done in an offline fashion; that is, the images of dolphins during observation are processed after coming back from the field.An opportunity for advancing this technology is to build a real-time system that can process these images, during boat based observations, and label those fins present with the most likely individuals.While these labels in early systems would need to be checked and rechecked in the lab, eventual top-performing systems with minimal error would provide much needed real-time intel for dolphin researchers observing interactive behavior in the wild.
Detection.One difficulty facing such real-time systems lies in the fact that the fins of the individual animals must first be automatically located within an image before a contour can be extracted.Many automated detection methods primarily use motion or color.Because the sea surface is constantly in motion and the colors of the fins and water can often be similar, these methods produce many false detections.The reader is referred to Paiva, Salgado-Kent, Gagnon, Parnum, and McCauley (2015) for active research in this area.
While it is difficult to achieve automated detection of dolphins from surface images or video by visual information alone, it may be easier through infrared video footage, because warm marine mammals stand out from the uniformly cool sea surface and this contrast results in fewer false detections (Graber, 2011).Infrared techniques for marine mammal detection have primarily been used from shore, but have also been used on ships (Zitterbart, Kindermann, Burkhardt, & Boebel, 2013).Work by Graber (2011) demonstrates that infrared video can be used to detect the dorsal fins of mammals from far distances.These methods can also be used to detect blows, as the breath expelled from whales as they surface is warmer than the surrounding water (Graber, 2011;Santhaseelan, Arigela, & Asari, 2012;Santhaseelan & Asari, 2015).Most of this work, however, is focused on larger, migrating whale species.

Aerial Methods
Although having skilled boat based observers is useful for real-time understanding of behavior and much can be learned from these observations at the water's surface, it is difficult to track all animals present.Some animals may not be visible and it can be difficult to determine the activity and arrangement of dolphins.To address these concerns, researchers soon started searching for new vantage points, taking to the skies.In this section, we examine a variety of video collection methods that capture behaviors hidden from surface views and provide avenues to automatically detect and track movement patterns.Although these methods might never provide some of the subtle social behavior present in close-up video footage, they nevertheless can provide researchers with data on behavior and patterns of interaction.
Data collection.Aerial video data collection was originally done via planes (Watkins & Schevill, 1979) and helicopters (Perryman & Lynn, 1994;Scott, Perryman, & Clark, 1985).These approaches, however, were prohibitively expensive and require altitudes that are too high for effective monitoring.Nowacek et al. (2001) introduced an overhead video system consisting of a remotely operated camera attached to the underside of a helium balloon.This system was used for many studies, some of which provided video data on how bottlenose dolphins forage (D.P. Nowacek, 1999), respond to approaching boats (S.M. Nowacek, 1999).Video feeds were immediately stored, monitored for quality and applicability to the study, and adjusted accordingly (Figure 1).
A similar but smaller system, dubbed the "blimp-cam" (Hodgson, 2007), was employed for extensive data collection on the behavior of dugongs (Dugong dugon).Lewis, Wartzok, and Heithaus (2011) similarly used a controllable video camera secured beneath a Floatograph© airship, a combination balloon kite system, to investigate leadership within a bottlenose dolphin community.Friedman et al. (2013) improved upon these platforms with the introduction of a low-cost aerial system, consisting of a hybrid parafoil kite and helium balloon that can be elevated to cover the whole study area and required no dedicated operator to orient the camera system.
Figure 1.Aerial systems, such as the "blimp-cam" from Hodgson (2007) provide video data on subsurface behavior and group configurations.Smaller balloons combined with kites (Lewis et al., 2011) have replaced older setups that are larger and more expensive (Nowacek et al., 2001).
Drone technology is becoming more mainstream and, as a result, is increasingly being used for the study of marine mammals.The use of unmanned aerial vehicles has several disadvantages however.One difficulty with drones is that automation typically consists of users producing a set of GPS waypoints, which directs the drone to pre-established locations.Whereas this works well for nature conservation surveys over land ( van Gemert et al., 2014), it is not ideal over water where the location of subjects is much less predictable.To maintain the proper viewpoint, there must be a dedicated individual, on an accompanying boat, to pilot and maneuver the camera, similar to the manual control present in blimp systems (Nowacek et al., 2001).Some groups have constructed their own proprietary drone systems and developed vision-based control mechanisms, which allow drones to follow an object at a distance given visual estimates (Selby, Corke, & Rus, 2011).However, these control systems are not available for cheap commercial models.The biggest drawback to using drones is that they can stay airborne for only roughly 20-25 min.Until battery technology improves, or unless researchers purchase a collection of drones, researchers will be limited to short recording segments.This does not allow for the recording of long behavioral sequences and limits the flexibility of researchers to deploy systems at the appropriate moments.
Detection.Although these aerial setups provide rich datasets to which continuous and consistent sampling can be applied, there is a lack of methods to deal with the incredible amount of data that is now being generated.Additionally, data collected by aerial techniques is typically processed offline, because algorithms that generate many candidate detections, and use sophisticated deep learning models for classification, require huge amounts of processing power.More lightweight algorithms, such as ones that detect objects by thresholding pixel color values, can run on-board and have shown promise when tracking large single animals against a sea surface that has uniform color (Selby et al., 2011).Other simpler models, which detect objects as collections of parts (deformable parts models) or provide quick classification (support vector machines), still have poor performance when dealing with large groups of animals (van Gemert et al., 2014).Even in a more successful system that detected half the animals present, roughly 95% of the detections were false positives (Mejias, Duclos, Hodgson, & Maire, 2013).Recent methods using deep learning have improved performance to detect over 80% of animals, but most of the detections (over 70%) still remain errors (Maire, Alvarez, & Hodgson, 2015).Although impressive, this replaces the work of manual detection with the manual elimination of false positives, which may be similarly time consuming.
Individual recognition.Working with aerial photos, recent advances in deep learning have allowed for even more effective automated identification.The National Oceanic and Atmospheric Administration (NOAA), in collaboration with MathWorks and the New England Aquarium, held a competition on Kaggle (Kaggle, 2015) to automate the process of identifying individual right whales (Eubalaena glacialis) from aerial photos, as fewer than 500 still remain and most individuals' characteristics are well documented (Bogucki, 2016).The winners of this competition achieved high performance in this task, and because their algorithm is reusable, this will greatly reduce the future manual labor for researchers by automatically identifying right whales in newly acquired photos.This model of providing the questions and data, and then outsourcing the computational work, has become increasingly popular.It leverages the computational resources of many skilled participants and fosters citizen science.

Underwater Methods
Observing marine mammals topside and from above offers a limited viewpoint and it is difficult for researchers to simultaneously track all of the animals in their study and determine the activity and arrangement of animals within a group.Even while tracking a group of dolphins, for example, their fins are not always visible from the surface and overhead cameras cannot always penetrate beneath the surface.
Data collection.Many research efforts routinely survey the ocean (Mellody, 2015; Somerton & Glendhill, 2005), and while these are not necessarily focused on marine mammals, there are exciting opportunities for researchers to use underwater remotely operated vehicles (ROVs) in regions densely populated or frequented by marine mammals.Transects with these devices might help to answer questions about density estimation or activity while targeted observation may be able to contribute meaningfully to studies that require continuous monitoring of ongoing behavior in a specific geographical location.The disadvantages of using ROVs include the underwater noise they create and the possibility that they will disrupt the animals' natural behavior or drive them from the study site.
With the increasing popularity, and decreasing expense, of establishing multi-camera networks, it may become possible to create fixed position systems that track multiple dolphins through an underwater ecosystem, similar to efforts to detect and track fish in coral reefs (Boom et al., 2012).Current research is investigating the advantages, and testing the limits, of such a system that allows for the continuous monitoring of dolphins to study behavior and cognition.For example, the dolphin pools at the Brookfield Zoo were instrumented with 10 cameras directed at underwater portions of the pools and three overhead cameras.These cameras record eight hours of video a day and provide a massive amount of video data that can be used to spearhead research for underwater detection and tracking of marine mammals (Karnowski, Hutchins, & Johnson, 2015).
Studies that use underwater hand-held video recording devices to capture dolphins in their natural habitat and in natural interactions work well as long as the animals are acclimated to humans in the water.For example, over 25 years of video collected while observing dolphins underwater in the Bahamas has led to a broad knowledge base on this species (Cusick & Herzing, 2014;Herzing & Elliser, 2013;Herzing, 2011;Herzing & Johnson, 1997).Recording underwater video also allows for the option of simultaneously recording sound, which is critical for studying communication.However, because such recordings are restricted to the researcher's focus and point of view, they may lack information on other individuals involved in group behavior and make it difficult to track a single focal or group over an extended period.
Automation of detection and localization.Whereas the majority of the underwater detection and tracking efforts have been applied to monitoring fish (e.g., Matai, Kastner, Cutter Jr., & Demer, 2010), the same algorithms represent a great baseline to build upon, provided the cameras are placed farther from the subjects.The first step in these detection algorithms is typically to subtract a static reference image, the background, to find the foreground objects (Figure 2).Background images can be found in various ways, including computing a moving average or using Gaussian Mixture Models (GMMs) (Spampinato, Chen-Burger, Nadarajan, & Fisher, 2008).A recent study on animal detection in camera traps on land utilized Robust Principal Component Analysis (RPCA) (Khorrami, Wang, & Huang, 2012) for background subtraction.RPCA decomposes a video with a static background into a background, a foreground, and an amount of noise.Using GMMs and RPCA for detection in captivity has allowed researchers to highlight when dolphins move between pools and to locate dolphins within an underwater image (Karnowski et al., 2015).Because underwater scenes containing animals often contain multiple species, especially in the wild, detection systems typically feed into algorithms that classify by species (Cline & Edgington, 2010;Shafait et al., 2016).This may prove useful in dolphin research when recordings include multiple species (Herzing & Johnson, 1997).Tracking.Underwater tracking algorithms have been centered on the use of the Meanshift algorithm and Camshift algorithm (Spampinato et al., 2008), which use the color histogram of previous frames to predict future target locations.More advanced tracking algorithms, however, take advantage of training data, adapt their object template to changing conditions, and are better at remaining with the current tracked object and not drifting to nearby areas of video.Real-time compressive tracking (CT) (Zhang, Zhang, & Yang, 2012) achieves high success rates on challenging datasets, so it is an ideal candidate for improving the state of the art in underwater tracking, including dolphin tracking.It assumes, however, that the initial tracking window has been specified through human interaction, so to make this fully automated would require algorithms that perform some initial detections.

Emerging Research
There have been many advances in automating the analysis of videos for marine mammal research, and there are many exciting avenues available for future studies.As one of the main ambitions for capturing a wealth of video is to understand marine mammal behavior, methods that automatically uncover activity types and detect patterns within group movements are tantamount to success.Activity recognition, however, is one of the most underdeveloped line of research, as it initially requires solid results in detection and tracking to be successful.Nevertheless, this is an active area of research and there are several labs working on solutions for detecting and classifying behavior in aerial and underwater footage.
Another nascent area of research is individual recognition in underwater video footage.Animals that have prominent markings, such as Atlantic spotted dolphins, can provide salient features from which individuals can be automatically distinguished (T. Starner, personal communication, December 12, 2015).Species with less prominent features, such as bottlenose dolphins, can still be distinguished by trained observers and, thus, the procedure should be amenable to automation.Observational setups, like the one established at the Brookfield Zoo, provide a sufficient number of instances from which algorithms can learn to differentiate individuals.Establishing the proper camera placements and extracting and processing the proper datasets will promote further solutions to this problem.Some advances to automated video processing may be unexpected, as developments in video capture technologies will likely continue to quickly develop beyond their immediate limits.For example, recent advances in technology have provided prototypes for unmanned vehicles that are equally adept at navigating through the air and underwater (Maia, Soni, & Diez, 2015).With new technology comes the opportunity to re-evaluate, improve, and re-invent the methods built on top of it.There are likely to be new challenges to tackle with automated methods and future algorithmic performance is likely to exceed expectations.

Conclusion
For those who are interested in marine mammal behavior and cognition, video provides a cheap and viable method for systematic and consistent monitoring that has untold benefits as data is analyzed and reanalyzed.Video can be obtained in a variety of ways, including but not limited to, boat based, aerial, and underwater capture methods.With these data, researchers have built automated systems to process and extract information relevant for detecting and tracking freely moving individuals as they engage in social activity.As more sophisticated video capture systems emerge, researchers will continue to tackle issues of how best to mine the wealth of data generated.The future is bright for video surveillance and the development of automated methods to analyze its data, and all those who are interested in furthering the field should seek to tackle open areas of research.

Figure 2 .
Figure 2. Example system that detects and classifies fish in images (Matai et al., 2010).