3.1 | Multiangle photography mechanical devices
The multiangle photography mechanical devices were utilized and updated from version one to version three (Fig. 2c). In version 1, the specimens rotated 360 ° in one direction on a turnplate and the angle of camera adjusted by the manual. In version two, the specimens could be rotated in two directions, in version three, the platform for specimens and camera integrated together: the specimens could be rotated in several directions automatically and more stable, and the camera could move in the one direction (Video S1). The updated device version three is fast and stable, when capturing muti-angle images of specimens, the total handing time is about 2 min per specimen.
3.2 | Overview of the Taichi workflow
The overall Taichi workflow can be subdivided into three main steps: species database construction, AI barcoding, and identification unknown species (Fig. 1b). The first step is establishing a database of morphological characteristics for known species by high-throughput multiangle photography of specimens and ML system training based on image stacks. The second step is the generation of diagnostic information among species by the ML system, named the AI barcode. The aligned multiangle images of the specimens from all known species, which were not used in the first step, are input into the trained ML system. The probability values of the aligned image output from the ML system form an array that represents the specimen.
The third step is the detection and confirmation of unknown species status facilitated by AI barcodes. We put aligned multiangle images of the unidentified specimens, including a mix of specimens of both known and unknown species, into the trained ML system and obtained AI barcodes. Based on the AI barcodes of these unidentified specimens, the differences among all specimens were calculated using permutational multivariate analysis of variance (PERMANOVA). These unidentified specimens were classified into known species and unknown species, which is the method used to detect unknown species. Then the status (new record or new species) of unknown species was confirmed.
3.3 | Taichi Step 1: Database of morphological characteristics of known species
We used two data sets as examples to test the Taichi workflow (Fig. 2a, Table S1). Dataset-1: Melanopopillia consists of 4 species and 50 specimens, including all three species known in this genus (M.prae, M.ding, and M.hain) and an undescribed new species (M.sp.n) as an example of finding a species previously undescribed (Fig. 2a). Dataset-2: Hong Kong beetles consist of 21 species and 206 specimens, including 4 new record species for Hong Kong as an example of finding new records for a region (Figs. 2a, 2b). Both data sets included additional specimens of known species as contrast groups (species code plus '-con'). To collect characters from the specimens, we obtain a series of images for each specimen by multiangle photography mechanical devices (Fig. 2c). A total of 1578 and 150 aligned images were collected per specimen for dataset-1 and dataset-2, respectively, which were collected in the same sequence of photography angles. Within each dataset, the images with the same aligned serial number from different specimens represent specimens photographed at the same angle (Fig. S1).
To establish a database of known species, we use multiangle random images of specimens to train the ML system. After 30 epochs, the best top-1 score of CNN identification systems in dataset-1 reached 97.76% and in dataset-2 reached 96.22% (Table S2). Except for AlexNet, the other three CNNs all perform well.
In this workflow, we have multiple selections for CNN models as long as they have better identification performance [38, 46]. Here, the “MobileNetV2” network model was selected as the backbone of the integrated workflow because of the good performance in identifying highly similar species, the high convergence speed and the high precision [40].
3.4 | Taichi Step 2: AI barcodes as diagnostic information
In Step 2, we selected four or five specimens from each known species that differed from the specimens in the last step input into the trained ML system, and 1578 aligned images in dataset-1 or 150 aligned images in dataset-2 were used for every specimen. After CNN calculation, the probability of each aligned image belonging to all known species was determined (dataset-1: three species, dataset-2: seventeen species). The probabilities of all images of each specimen made up the AI barcode (Figs. 1b, 3a). To visualize the result of AI barcodes, the t-distributed stochastic neighbor embedding (t-SNE) algorithm was employed to display the similarities and differences between pairs of species and specimens in the reduced-dimensionality feature space [41, 42].
We observed a noticeable variation in AI barcodes among species (Fig. 3a). More precisely, in dataset-1, the five replicated specimens of M.prae, M.ding, and M.hain all showed a high probability for their own species (Fig. 3b), and the three known species were separated from each other (Fig. 3c). Dataset-2 displayed similar trends (Figs. 3d, 3e), except for a few specimens: specimen 5 of Carab-3, specimen 4 of Elate-2, specimen 1 of Elate-3, and specimen 5 of Scara-3 showed relatively dispersed AI barcodes, and specimen 5 of Carab-3-con showed a high probability of belonging to Carab-2 species (Fig. 3d, Fig. S2). In the t-SNE plot, the 17 known species were still separated from each other (Fig. 3e). Although several specimens showed a range of variations, the feature space of the species, including all specimens, still showed clear separation.
In all angles of the AI barcodes of dataset-1, the top continuous angles were ID 185 to 224 which mainly contain the elytron and pygidium, and the latter was considered as the key characters to distinguish the species within the genus [34, 35].
3.5 | Taichi Step 3: Unknown species detection and status confirmation
In Step 3, image stacks of unknown species (new species or new record species and known contrast species) were put into the trained ML system to obtain their AI barcodes. After comparing the resulting AI barcodes with those of known species determined in Step 2, we found that different from the AI barcodes of known species that concentrate high probability in one species, the new species or new record species usually showed a relatively dispersed distribution pattern. In the t-SNE plots, new species or new record species were clearly separated from all known species, while known species overlapped with their own species. In this way, the unknown species could be detected.
Specifically, M.prae-con in dataset-1 showed an AI barcode spectrum similar to M.prae, and M.sp.n showed relatively inconsistent patterns (Fig. 3b). In the t-SNE plot, M.prae and M.prae-con almost overlapped with each other, while M.sp.n was represented by the largest circle, which was very different from the small and concentrated circle of other known species (Fig. 3c). The same pattern appeared in dataset-2: new record species (Hydro-1-n.r., Hybos-1-n.r., Scara-5-n.r. and Scara-6-n.r.) all displayed dispersed AI barcodes, but each species had its own distribution pattern. In the t-SNE plot, four new record species were separated from all known species, but showed different distribution patterns: Scara-5-n.r. and Scara-6-n.r. were located close to the known species Scara-2; Hybos-1-n.r. and Hydro-1-n.r were located close to the known species Scara-1 and/or Morde-1 (Fig. 3e). At the same time, four contrast known species (Scara-3-con, Teneb-1-con, Carab-4-con, and Carab-3-con) all showed high probability with their own species and overlapped with their corresponding species in the t-SNE plot. The results of dataset-2 showed that the new record species have AI barcodes that differ not only from those of all known species but also from those of each other.
To quantify the differences among specimens and species, we analyzed AI barcodes using PERMANOVA with 1, 000 permutations [43–45]. The adjusted p values are shown in Fig. 4a. Overall, p.adj = 0.05 could be considered as the significance threshold to distinguish species groups. All known species and new species/new record species were significantly separated from each other, and the contrast known species were correctly grouped with their corresponding species.
For dataset-1, M.sp.n group was significantly separated (p.adj < 0.05) from all known species of Melanopopillia (M.prae, M.ding, and M.hain), and M.prae-con showed no difference (p.adj = 0.15) from M.prae (Fig. 4a). For dataset-2, all known species were separated (p.adj < 0.05) from each other, and the “new” species Hydro-1, Hybos-1, Scara-5, and Scara-6 were separated (p.adj < 0.05) from all known species and from each other. The contrast groups Carab-3-con, Carab-4-con, Scara-3-con, and Teneb-1-con showed no difference (p.adj = 0.671, p.adj = 0.929, p.adj = 0.087, p.adj = 0.218) from their corresponding species (Fig. 4b).
3.6 | Confirmation of new species or new records based on traditional examination
The new inferred species in dataset-1 were later studied by morphological comparison and DNA barcoding methods. The results of morphological comparison suggested that the new species is indeed similar to its known congeners but differs in the punctation of the pronotum, elytra striae, and basic shape of the aedeagus in males. Among the three known species, M.hain shared the most characters with this new species. The results of DNA barcoding analysis also support this inferred new species (Text S1, Table S3, Fig. S3). The description of this new species and the revision of the genus Melanopopillia will be published in the future.
In dataset-2, four new record species belonged to the same order (Hydro-1-n.r), superfamily, subfamily, and genus as known species; therefore, they probably have different morphological distances from all known species. Specifically, Hybos-1-n.r belongs to the superfamily Scarabaeoidea, as do Scara-1, Scara-2, Scara-3, and Scara-4; Scara-6-n.r belongs to the subfamily Rutelinae, as do Scara-3; and Scara-5-n.r belongs to the genus Sophrops, as do Scara-2. The true classification status and similarities of these new species and records are also reflected in the AI barcode results to a certain extent, such as M.sp.n being close to M.hain, and Scara-5-n.r. being close to Scara-2.