Human visual identification of individual Andean bears Tremarctos ornatus

It is often challenging to use invasive methods of individual animal identification for population estimation, demographic analyses, and other ecological and behavioral analyses focused on individual-level processes. Recent improvements in camera traps make it possible to collect many photographic samples yet most investigators either leap from photographic sampling to assignment of individual identity without considering identification errors, or else to avoid those errors they develop computerized methods that produce accurate data with the unintended cost of excluding participation by local citizens. To assess human ability to visually identify Andean bears Tremarctos ornatus from their pelage markings we used surveys and experimental testing of 381 observers viewing photographs of 70 Andean bears of known identity. Neither observer experience nor confidence predicted their initial success rate at identifying individuals. However, after gaining experience observers were able to achieve an average success at identifying adult bears of 73.2%, and brief simple training further improved the ability of observers such that 24.8% of them achieved 100% success. Interestingly, observers who were initially more likely to falsely identify two photos of the same bear as two different bears than vice versa were likely to continue making errors and their bias became stronger, not weaker. Such biases would lead to inaccurate population estimates, invalid assessments of the bears involved in conflict situations, and underestimates of bear movements. We thus illustrate that in some systems accurate data on individual identity can be generated without the use of computerized algorithms, allowing for community engagement and citizen science. In addition, we show that when using observers to collect data on animal identity it is important to consider not only the overall frequency of observer error, but also observer biases and error types, which are rarely reported in field studies.

Visual identifi cation of individuals, either from direct sightings or from imagery such as camera trap photos, benefi ts many lines of research, from studies of development (Swanson et al. 2013) and behavioral ecology (Charpentier et al. 2008) to population estimation (Ngoprasert et al. 2012) and other analyses with direct conservation implications (reviewed by McGregor and Peake 1998). Th us, for decades natural markings have been evaluated for the noninvasive identifi cation of individuals of various species (Pennycuick 1978, Jarman et al. 1989). However, although natural markings never produce perfect individual identifi cation (Pennycuick 1978, Jarman et al. 1989, identifi cation errors are often unreported (but see Stevick et al. 2001, Frasier et al. 2009). Th ere are two types of identifi cation errors: false matches (i.e. incorrectly identifying images from multiple individuals as images from one), and false mismatches (i.e. incorrectly identifying multiple images of the same individual as images from multiple individuals). Th ese two error types may skew subsequent analyses and conclusions diff erently. For example, false matches may lead to underestimates of population size, while false mismatches can lead to overestimates of population size (Stevick et al. 2001, Hastings et al. 2008, Goswami et al. 2012). Th us, to reach valid conclusions it is critical that researchers quantify and characterize identifi cation errors (Yoshizaki et al. 2009). Error rates may diff er with training and experience (Stander et al. 1997, Diefenbach et al. 2003, Schofi eld et al. 2008), but experience does not automatically produce low error rates (Diefenbach et al. 2003, Patton and Jones 2008, Evans et al. 2009) even if the observer is confi dent in their own ability to identify the target species (De Angelo et al. 2010). Th us, regardless of observer experience or confi dence, observer ability must be quantifi ed. Actual errors of identifi cation cannot be determined solely from non-invasive photos of wild individuals (R í os-Uzeda et al. 2007, Bashir et al. 2013), but accuracy and precision can be assessed by testing identifi cation assigned blindly to known (i.e. captive) individuals (Higashide et al. 2012).
Variation in natural markings has been used to identify individuals of some bear species (Noyce et al. 2001, Higashide et al. 2012, Ngoprasert et al. 2012. Because the markings on the face, throat and neck of Andean bears Tremarctos ornatus have been thought to diff er among individuals (Th omas 1902, Hornaday 1911, some researchers have begun using them to assign individual identity to bears in photos from camera traps (R í os-Uzeda et al. 2007, Zug 2009, Jones 2010. However, these methods have not been thoroughly tested. Although Roth (1964) and Eck (1969) described variation in markings among captives, neither examined many bears (n ϭ 19, n ϭ 5 respectively). In addition, although the markings are present from birth (Saporiti 1949, Roth 1964, Dathe 1967, Eck 1969, their permanence is untested and it is unknown if any changes in the markings would aff ect individual identifi cation. We suspect that the highest rate of apparent disappearance of ' known ' wild Andean bears will occur during subadulthood due to increased mortality as cubs become independent of their mothers and due to primary dispersal; neither of these processes has been studied in this species. If the markings of cubs change enough during maturation to confound individual identifi cation then this will further increase the apparent disappearance of known individuals, infl ating estimates of mortality and dispersal. So, although comparing images of cubs and adults may be a less common task for researchers than comparing images of adults, the former task warrants special consideration. Although some researchers have developed projectspecifi c protocols for identifying individual Andean bears (Zug 2009, Jones 2010, methods are not standardized across studies (Garshelis 2011) and there are typically no estimates of error (Goldstein andM á rquez 2004, Garshelis 2011), making it pointless to compare results across studies. For example, it is possible that two studies might produce similar estimates of bear density even though one study was conducted in an area with a lower density of bears, simply because there was a higher and unmeasured rate of false mismatches in that data set.
Th ere are numerous methods for computer-assisted or automated identifi cation of individuals of several mammal species (Kelly 2001, Karlsson et al. 2005, Hiby et al. 2009, Goswami et al. 2012. Th ese methods allow for rapid identifi cation, which can otherwise be labor-intensive with many individuals or photos, and they may achieve better accuracy than manual identifi cation when identifi cation is challenging (Kelly 2001). Th ose two advantages are likely not relevant for Andean bears, whose markings appear diff erent, and which are thought to live at low densities (but see Garshelis 2011). In addition, manual identifi cation harnesses the ability of humans to correct for image variation due to occlusions and shadows, which remains challenging for imaging processing software (Allen et al. 2011). If manual identifi cation of individuals achieves high accuracy and good precision, this would avoid three key disadvantages of computerized identifi cation in research and conservation of Andean bears: the development of such methods requires expertise and funds not often available to fi eld programs, they preclude fi eld identifi cation of individual bears during direct observations, and they require technicians to be computer literate, excluding most local residents.
Conservation science is concerned not just with knowledge, but also with conservation impact, which may be enhanced through local participation (Danielsen et al. 2007). Local attitudes have important eff ects on conservation of Andean bears (Velez-Liendo 2005), especially because the bears come into confl ict with humans , local communities are active inside many protected areas (Naughton- Treves et al. 2006), and most tropical forests lie outside of protected areas (Chazdon et al. 2009). Engaging local people in research can capitalize on their knowledge and skills (Stander et al. 1997, Zuercher et al. 2003, Sharma et al. 2005) and lead to better communication and better conservation outcomes (Peyton 1989, Byers 1999. We therefore assess the permanence of Andean bear markings and whether observer characteristics, experience, and training aff ect their performance, to explore whether the use of minimal technology may produce high quality data and enhance their potential conservation impact.

Material and methods
We collected portraits of captive Andean bears of known identity and age from zoo personnel and fi eld researchers in North America, Europe and South America. We discarded images with poor resolution, lighting or clarity, but did not reject images with extreme camera angles. Th e images we retained illustrated a wide range of facial markings, ranging from no facial markings to broad full circles around both eyes and depigmentation or ' grizzling ' across much of the rest of the face. To assess the permanence of facial markings we visually compared across time the markings of the 24 bears for which we had photos as both cubs and adults. We also looked for evidence of grizzling in the photos of all 64 known-age bears in our sample.
To assess humans ' ability to identify individual Andean bears we fi rst created online surveys in English and Spanish, the common language across the species ' range, using 65 diff erent photographs of 39 known bears. To evaluate participants with a variety of personal and professional backgrounds we solicited participation in the survey by emails to colleagues, peers, and personal contacts, as well as through an announcement in the International Bear News (Paisley et al. 2010). Participants reviewed 21 pairs of images: six pairs of images of adults spanning up to 13 years from the same bear, and 15 pairs that included one photo of a cub and one photo of an adult ( ' cub -adult ' pairs) spanning up to 23 years from the same bear. For each pair, participants responded to the question " Are the photos above of the same bear? " with one of three responses: " yes " , " no " , and " unable to determine " . Th is task mimics some of the identifi cation tasks faced during fi eld research, such as when a bear under observation in a corn fi eld must be compared to a photo taken earlier during a similar event, or when a recently retrieved camera trap photo must be compared to a camera trap photo taken at another location. To examine whether success was aff ected by participants ' background or personal characteristics we collected information on participant sex, age and experience working with bears, Andean bears, or with visual identifi cation of individuals of any wildlife species. To remove potential biases caused by poorly motivated participants, or by missing data, we only analyzed data from participants who answered at least 15 of the 21 questions. We measured participant performance as the proportion of responses that were correct. Th e average age of the 120 online participants (50 men, 70 women) who answered at least 15 of the 21 questions was 36.4 Ϯ 12.1 years. Nineteen participants (16%) had experience working with bears but not Tremarctos , 10 participants (8.3%) had experience working with Tremarctos , and 68 participants (56.7%) had experience with visual identifi cation of individual wild animals.
We assessed whether participant performance diff ered from random and if it diff ered between adult pairs and cub -adult pairs, and if it diff ered depending on whether markings changed during maturation, using t-tests and paired t-tests. We then used an information theoretic approach (Burnham and Anderson 2002) to compare all possible models for online participant performance, built with data on participant characteristics (i.e. sex, age, experience with bears, experience with Tremarctos , and experience with visual identifi cation of individuals). We did not include interaction terms in potential models, and we used AIC c as a key criterion for model comparison (Burnham and Anderson 2002).
We also assessed the eff ects of experience and simple training through experimental testing of staff , volunteers and visitors at San Diego Zoo ' s Inst. for Conservation Research, using 94 photographs of 55 known bears. We implemented experimental sessions to groups in a pre-post test study design with ' experience ' groups (E groups) and ' experience and training ' groups (E -T groups, Oppenheim 1992). All sessions were less than 30 min long and began with a 5-min overview of Andean bear ecology and conservation, conservation research, the purpose of the session and instructions on how to enter their responses into our Classroom Performance System (ver. 1.50.063), an interactive system that allows participants to use remote controls to record their answers. A portrait of an Andean bear was shown while it was explained that individual Andean bears might be recognized by their unique markings, including muzzle freckles. We told participants that they would review pairs of images and that for each pair they would have 20 s to respond to the question " Are these the same bears? " with one of three responses: " Yes, these are the same bears " , " No, these are not the same bears " , and " I am not sure if these are the same bears " . We stressed that each possible answer, including uncertainty, was viable. We asked participants to compare as many features of the markings as possible, excluding nose color, and using caution when interpreting photos with extremes of lighting or orientation.
Within each session we displayed 60 pairs of images sequentially, in the same order in both treatments. In both treatments the fi rst 15 comparisons were the pre-test, while the last 15 comparisons were the post-test. Th us, the treatments diff ered only in the presentation of the middle 30 comparisons. In the E treatment the transition between the three sections was seamless with a 3-min break at question 34. Th us, any change in E participant performance, as measured by a comparison between the fi rst 15 comparisons ( ' initial ' performance) and the last 15 comparisons ( ' fi nal ' performance), would be due to viewing additional images of Andean bears. In the E -T treatment, while viewing the middle 30 pairs of images the participants received simple training. After viewing each pair for 20 s a group discussion was held in which participants shared their answers and reasoning aloud. Th e instructor then revealed the correct answer and highlighted meaningful comparisons between images. Th ree points of comparison were illustrated for pairs of images showing the same bear (matches) and 2 -3 points of comparison were highlighted for pairs of images showing diff erent bears (mismatches). If it was not possible to determine if a pair of images displayed the same or diff erent bears, the instructor illustrated this (e.g. diff erences in image angle). We conducted sessions as competitions in which the highest score on pre-and post-tests earned a small nonmonetary prize.
We used the same sequences of the same images in both treatments. Although the pairs of images diff ered between the pre-and post-tests, each test contained six matches, six mismatches and three comparisons that were impossible to identify as a match or mismatch. Both the pre-and posttests were composed of six pairs of adult images and nine cub -adult pairs. Among the middle 30 comparisons (14 cub -adult pairs and 16 adult pairs) there were 14 matches, 14 mismatches and two comparisons that were impossible to identify as matches or mismatches.
We collected data on four participant characteristics: sex, age (four categories), self-perceived ability to identify Andean bears fi ve categories), and the frequency with which the participant observed wildlife (fi ve categories). We ensured confi dentiality and anonymity by collecting no personal identifying information. Because we had no prior knowledge of how well participants might succeed, and which participant characteristics might aff ect initial performance, we compared participant performance to random and then used an information theoretic approach to compare all possible models built with data on participant characteristics. We used a similar approach to investigate changes in measurements of interest (e.g. success identifying individuals), comparing models built on ' treatment ' with those also containing ' group ' nested within ' treatment ' ; we did not include interaction terms in potential models and we used AIC c as a key criterion for model comparison. To assess whether participant success with cub -adult pairs might respond diff erently than participant success with adult images, we conducted statistical analyses of identifi cation separately for cub -adult pairs, and for adult pairs.
We could not predict whether matches or mismatches would be more common among observers ' errors. For some species false matches have been found less common than false mismatches (Stevick et al. 2001) while in other systems the reverse was true (Patton and Jones 2008) or the bias was small (Frasier et al. 2009). We therefore assumed that participant errors would be equally divided among false matches and false mismatches, yielding a 1:1 ratio of false matches:mismatches. We analyzed ratios of error types as we did participant success, to investigate participants ' initial ratio of false matches:mismatches, and the change in error type ratio across participants and treatments. Th ree hundred twenty people participated in experimental tests. Technical diffi culties with the fi rst two groups of participants (n ϭ 55) and incomplete data from a few participants (n ϭ 4) led us to discard those data, leaving data from 136 E participants (seven groups) and 125 E -T participants (seven groups).
Unless otherwise noted all quantities are expressed as x -Ϯ SD; statistical signifi cance refers to two-tailed p ϭ 0.05. Statistical analyses were conducted in JMP ver. 9.0.3

Results
Th e facial markings did not change during maturation in most Andean bears for which we have photos as both cubs and adults (66.7%, n ϭ 24). However, the markings of some cubs became thinner or less obvious between approximately the fi rst and second year of life. Th is appears to occur symmetrically on both the left and right sides of the face and if a cub ' s marking was thin or faint it may not be apparent when the bear is an adult (Fig. 1). If an Andean bear survives long enough its appearance may change again through grizzling; based on photos of 25 grizzled Andean bears it appears that grizzling fi rst appears around the eyes and pre-existing markings and can eventually spread across the entire face (Fig. 2). Th e youngest bear for which we have evidence of grizzling was eight years old; most bears photographed when over 10 years old showed some grizzling (80%, n ϭ 30).  Th e average success of online participants at identifying cub -adult pairs diff ered depending on whether the cub ' s markings thinned during maturation. Participants correctly answered on average 58.6% ( Ϯ 15.7%) of questions in which cub ' s markings did not change (n ϭ 10), which was not as expected at random (i.e. 50%; Fig. 3, t ϭ 6.019, DF ϭ 119, p Ͻ 0.001). Th ey correctly answered on average 30.1% ( Ϯ 21.1%) of questions with cub -adult pairs in which the markings did change (n ϭ 5), which was also not random (t ϭ -10.32, DF ϭ 119, p Ͻ 0.001) and diff erent from their average success with cub -adult pairs in which the markings did not change (t-ratio ϭ -11.808, DF ϭ 119, p Ͻ 0.001). Th e best model for participant average success with cub -adult pairs in which markings did not change (i.e. ' age ' ) was not well supported by the data (R 2 ϭ 0.005, DF ϭ 119, F-ratio ϭ 1.124, p ϭ 0.206). Similarly, the best model for average online success with cub -adult pairs in which markings did change (i.e. ' age ' and ' experience with bears ' ) was also not well supported by the data (R 2 ϭ 0.055, DF ϭ 119, F-ratio ϭ 2.244, p ϭ 0.087). Th us, no participant characteristic had a meaningful impact on how well participants in the online survey could identify cub -adult pairs, whether or not cubs ' markings changed. the data (R 2 ϭ 0.012, DF ϭ 260, F-ratio ϭ 3.19, p ϭ 0.075). Th us, training did not obviously improve the ability of participants to identify cub -adult image pairs more than did simple experience; in the end the E participants could identify 62.7% ( Ϯ 15.3%) of cub -adult image pairs and the E -T participants could identify 67.2% ( Ϯ 13.5%) of cub -adult image pairs. Motivated exposure to additional images of bears improved participant success at identifying cub -adult pairs.
Th e initial performance of experimental participants at identifying adult image pairs was better than expected at random (Fig. 4, 64.9 Ϯ 18.1%, n ϭ 261, t ϭ 13.316, p Ͻ 0.001) and it was better than these participants ' initial performance at identifying cub -adult pairs (t-ratio ϭ -8.673, DF ϭ 260, p Ͻ 0.001). Th e best model for the initial success of experimental participants in identifying adult image pairs (i.e. ' treatment ' ) was not well supported by the data (R 2 ϭ 0.014, DF ϭ 260, F-ratio ϭ 3.559, p ϭ 0.06). Th us, there was no indication that any participant characteristic had a meaningful impact on the initial success in identifying adult image pairs. After treatment, participants were better at identifying adult image pairs (i.e. the % change in performance was not 0; Fig. 4, 13.5 Ϯ 21.1%, n ϭ 261, t ϭ 10.35, p Ͻ 0.001). Th e best model for the improvement in identification of adult image adult pairs (i.e. ' treatment ' ) explained little of the variation in the data (R 2 ϭ 0.026, DF ϭ 260, Fratio ϭ 6.843, p Ͻ 0.001). Th us, training did not obviously improve the average ability to identify adult image pairs more than did simple experience; in the end, on average E participants could identify 73.2% ( Ϯ 15.2) of adult image pairs and E -T participants could identify 84.1% ( Ϯ 11.9) of adult image pairs. In other words, motivated exposure to additional images of bears, but not simple training, improved average participant success at identifying adult pairs. Participants in the E treatment and in the E -T treatment were both able to identify on average more adult image pairs than cub -adult pairs (t-ratio ϭ Ϫ 6.803, DF ϭ 135, p Ͻ 0.001 and tratio ϭ Ϫ 10.276, DF ϭ 124, p Ͻ 0.001, respectively). Before treatment, 3.7% of E participants (5 of 136) successfully Figure 3. Changes in markings aff ected average online identifi cation success. Th e average online success of 120 participants was highest for adult pairs and cub -adult pairs in which the markings did not change, and better than expected at random in both cases. However, the average online success of participants was lower for cub -adult pairs in which the markings did change; participants performed worse than expected at random. Asterisks indicate signifi cant diff erences from random and letters indicate statistically signifi cant groupings of averages. Figure 4. During experimental testing the initial average participant success at identifying individuals was better than expected at random for both cub -adult pairs and for adult pairs. Th is average success improved regardless of treatment, indicating that simple exposure to more images improved success. Simple training did provide marginally better average performance than experience alone. Asterisks indicate signifi cant diff erences from random and letters indicate statistically signifi cant groupings of averages.
Th e average success of online participants at identifying adult pairs of images (55.9 Ϯ 19.7%) was better than expected at random (Fig. 3, t ϭ 2.852, DF ϭ 119, p ϭ 0.005). Although it was not statistically diff erent than these participants ' average success at identifying cub -adult images of cubs whose markings did not change (t-ratio ϭ Ϫ 1.769, DF ϭ 119, p ϭ 0.08), average success identifying adult pairs was signifi cantly diff erent than participant success with cub -adult images of cubs whose markings did change (t-ratio ϭ -9.431, DF ϭ 119, p Ͻ 0.001). Th e best model for average online success with adult images (i.e. ' age ' ) was not well supported by the data (R 2 ϭ 0.016, DF ϭ 119, F-ratio ϭ 1.94, p ϭ 0.166). Th us, average participant success online diff ered between adult pairs and cub -adult pairs only if the markings changed during maturation, and average online success with adult pairs was not explained by any measured participant characteristic. However, averages do not reveal the success of the best participants, which is relevant for assessing potential to produce high-quality data. Although the average online success rate with adult images was far below 100%, fi ve online participants (4.2%) did successfully rate all adult pairs.
In experimental testing, participant initial success in identifying cub -adult image pairs was slightly better than expected at random (Fig. 4, 53.7 Ϯ 12.6%, n ϭ 261, t ϭ 4.734, p Ͻ 0.001); none of the illustrated cubs ' markings changed during maturation. Th e best model for initial success in identifying cub -adult image pairs (i.e. ' participant sex ' ) was not well supported by the data (R 2 ϭ 0.002, DF ϭ 260, F-ratio ϭ 0.617, p ϭ 0.433). Th us, no participant characteristic had a meaningful impact on their initial success in identifying cub -adult image pairs. After treatment, during which six of eight cub -adult pairs illustrated changes in markings during maturation, participants were better at identifying cub -adult image pairs (i.e. the % change in performance was not 0; Fig. 4, 11.2 Ϯ 17.8%, n ϭ 261, t ϭ 10.14, p Ͻ 0.001); the cub -adult pairs shown after treatment did not illustrate changes in markings. Th e best model for the improvement in identifi cation of cub -adult image pairs after treatment (i.e. ' treatment ' ) explained almost none of the variation in than the overall pool of participants after beginning with a stronger bias for false mismatches.

Discussion
A bear ' s age aff ects individual identifi cation. Th e oldest Andean bears may be quickly identifi ed by grizzling, if wild bears live so long, and the markings of some cubs become thinner during maturation, which made it more diffi cult for our online participants to identify the cubs as adults. However, changes in the markings may not be the only characteristic that makes it diffi cult to identify cubs as adults. Although trained participants in the experimental tests were shown how the markings might change during maturation, they then still found it challenging to identify cubs as adults even when the markings had not changed. Th is suggests that great caution is needed when comparing images of cubs and adults: only observers with proven ability should make such comparisons. Th is is especially important because this identifi cation challenge arises during a poorly understood life stage when various processes might cause an individual to appear to vanish from a population (e.g. dispersal).
We believe there are three reasons why participants in the online survey were unsuccessful at identifying individual bears, regardless of their prior experience with bears, or with Tremarctos . First, there were only 10 participants with any experience working with Tremarctos , and the type and duration of their experience varied. Second, because each participant with experience of Tremarctos was probably exposed only to bears in their own work, it was unlikely that any participant had seen as many diff erent bears, and as much variation in markings, as in the images we presented. Th ird, although experience sometimes confers improved ability (Stander et al. 1997, Diefenbach et al. 2003, Schofi eld et al. 2008), this is not always the case (Diefenbach et al. 2003, Patton and Jones 2008, Evans et al. 2009); experience may have conferred overconfi dence. Initial success in the experimental assessment was not only low, it was unrelated to participant confi dence. Th is disconnect between self-perceived and true ability is not novel. Competent individuals may have an accurate perception of their performance, but individuals who are not competent at a task are sometimes unable to evaluate their own performance (Kruger and Dunning 1999, Dunning et al. 2003, De Angelo et al. 2010). Th us, experience and self-perceived ability to identify individuals do not guarantee data of any particular quality. Some people ' s performance in this somewhat subjective task improves with experience, as illustrated by our experimental data, but other people ' s performance does not.
Although simple training did not improve people ' s ability on average to identify individual Andean bears more than did viewing images of Andean bears, simple training yielded more trainees who were able to successfully identify all pairs of images. Many trainees could not, but we believe that additional training of motivated observers would magnify the benefi cial eff ect we observed and allow for collection of accurate data. We suspect there is no one living alongside Andean bears who is an expert at identifying more than a few familiar individual bears, but training followed by assessment could improve the research value of some local residents who identifi ed all pairs of adult images; after treatment, 7.4% (10 of 136) of E participants did so, illustrating that their experience identifying bears did not signifi cantly increase the proportion of participants that were entirely successful (e.g. χ 2 ϭ 1.76, DF ϭ 1, p ϭ 0.184). Before treatment, 4.8% of E -T participants (6 of 125) successfully identifi ed all pairs of adult images; after treatment, 24.8% (31 of 125) of the E -T participants did so, revealing that simple training made it more likely that some participants correctly identifi ed all pairs of adult images ( χ 2 ϭ 19.83, DF ϭ 1, p Ͻ 0.001). In fact, the proportion of participants responding perfectly was greater after simple training than after motivated exposure to images ( χ 2 ϭ 14.97, DF ϭ 1, p Ͻ 0.001).
Were errors in individual identifi cation of adult image pairs equally divided among false matches and false mismatches? Experimental participants initially had a strong bias for false mismatches: the initial ratio of false matches to false mismatches was 0.44 Ϯ 0.64 (n ϭ 154, DF ϭ 153, t ϭ Ϫ 11.03, p Ͻ 0.001). Th e best model for this bias (i.e. ' treatment ' ) was not well supported by the data (R 2 ϭ 0.028, DF ϭ 153, F-ratio ϭ 4.31, p ϭ 0.04). Th us, there was no indication that any characteristic of these participants had a meaningful impact on the ratio of their error types when initially assessing adult images. Among participants who continued to make errors identifying adult image pairs, the fi nal ratio of false matches to false mismatches was 0.22 Ϯ 0.47, still not 1:1 (n ϭ 173, DF ϭ 172, t ϭ -21.65, p Ͻ 0.001). Th is was a change of Ϫ 0.17 Ϯ 0.81 from the initial ratio, diff erent from the change expected by chance (i.e. 0; n ϭ 112, t ϭ Ϫ 2.278, p ϭ 0.025): among those who continue to make errors the bias for false mismatches was stronger after treatment than before. Th e best model for the change in ratio of false matches to false mismatches included only ' treatment ' and it explained virtually none of the variation in the data (R 2 Ͻ 0.001, DF ϭ 111, F-ratio ϭ 0.006, p ϭ 0.94). Th us, neither treatment nor group nested within treatment had a meaningful impact on the change in the ratio of false matches to false mismatches, although the bias for false mismatches across pairs of adult images was stronger after either treatment.
Why did the bias for false mismatches increase after experience, or experience and training? Was it because the participants who continue to make errors were a biased subset of the overall pool of participants? Perhaps they had greater diffi culty with the task from the beginning, in which case their initial success rate would be lower than the people who made no errors after treatment, or perhaps they had a stronger bias for false mismatches from the beginning. An AIC analysis of the full model set (n ϭ 3) including as variables the participant ' s initial success rate across adult images, and the participant ' s initial ratio of false matches to false mismatches across adult images, revealed that the second explanation is better supported by the data. Th e best model for the ratio of false matches to false mismatches across adult images included as a predictor only the participant ' s original ratio of false matches to false mismatches across adult images (R 2 ϭ 0.6387, DF ϭ 111, F-ratio ϭ 194.5, p Ͻ 0.001, ϭ 0.741), and it was better supported than the other two models ( Δ AIC c Ͼ 2.0 for the other two models). In other words, those participants who continued to misidentify adult images fi nished with a stronger bias for false mismatches possible, researchers should estimate intra-observer consistency and inter-observer agreement as indicators of data reliability (Higashide et al. 2012, Ngoprasert et al. 2012. Some inter-observer agreement will arise from chance so we recommend the use of the kappa statistic instead of percent agreement (Forcada and Aguilar 2000, Watkins and Pacheco 2000, Viera and Garrett 2005.
We have illustrated that with proper training and assessment it should be possible to engage residents of local communities in the identifi cation of individual wild Andean bears. We believe this will also be true for other species of non-social mammals living at low densities, allowing accurate data on individual identity to be generated through community engagement and citizen science if training and evaluation are suffi ciently rigorous and multiple observers are involved. In addition, we have shown that investigators should consider not only the overall frequency of observer error after training, but also observer biases, which are rarely reported in fi eld studies. Th e benefi ts, disadvantages, and socio -political dynamics of bear research and conservation vary widely across contexts so the decision to use local people to collect data on individual bear identity will need to be context-specifi c. Our hope is that by considering observer biases, and engaging local citizens in data collection whenever possible, researchers will not only generate reliable data and replicable results, they will also achieve better conservation outcomes. already have additional research skills. Others have also seen that minimal training can sometimes circumvent the need for complex technological approaches (Patton and Jones 2008, Schofi eld et al. 2008, Jones 2010, although the use of image manipulation tools might improve identifi cation success in some circumstances. Due to their original bias, those participants who continued to make errors were twice as likely to make a false mismatch than a false match. False mismatches, which can lead to overestimates of population size (Stevick et al. 2001, Hastings et al. 2008, Goswami et al. 2012, are particularly troublesome given that the Andean bear is vulnerable to extinction (Goldstein et al. 2010). Th is suggests that some people will require additional training to achieve acceptable levels of success, or that they should not identify individual bears. To further safeguard against errors, we recommend that whenever possible, the individual identity of any bear be independently assigned by at least three observers of demonstrated ability, whether they are research staff or citizen scientists. We suggest that, as in Mendoza et al. (2011), each observer independently assign individual identity before reviewing the assignments made by the other observers, and fi nally reaching a consensus assignment of individual identity. Th e use of multiple observers should also help mitigate differences that may exist among observers in their initial willingness to assign individual identity rather than cautiously withholding judgment as to which animal has been detected (Mendoza et al. 2011). Th at level of caution may vary not only between observers but may also vary with sample size, such that as the sample size decreases the observer ' s level of caution may also decrease in a perhaps unconscious attempt to ensure suffi cient analytical power. Although it is not obvious to us that variation in caution across observers or sample sizes will create a bias towards matches or mismatches, the impact of such errors will be larger when sample sizes are small, suggesting that evaluation of data quality becomes even more important as data become scarce.
With training, many observers are capable of success over 95%, which we believe should be the minimum level of identifi cation success across studies. As has been suggested by others (Foster and Harmsen 2012), we urge investigators to quantify and report the rates at which they make errors, the types of errors they make, and the potential eff ects of those errors on their results. In addition, for at least some identifi cation tasks, there is intra-observer variation over time (Bindemann et al. 2012). Th us, assessment of identifi cation success and quantifi cation of error rates should be an ongoing process and not just an evaluation of the effi cacy of training. We suggest that images of ' known ' identity be inserted amongst the photographs of ' unknown ' identity at intervals that cannot be anticipated by the people being assessed; to assess observer performance with suffi cient precision at least 20 images of ' known ' identity would need to be included in such an evaluation (i.e. 19/20 ϭ 95%). Images of ' known ' identity may either be images of captive bears, as we have used, or they may be high quality images of wild bears that can be repeatedly and consistently assigned identity. As data are collected from the fi eld it will likely be impossible to evaluate the true accuracy of individual identifi cation since there may be no independent source of bear identity. However, even when independent measures of accuracy are not