EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation using Accelerated Neuroevolution with Weight Transfer

Neural architecture search has proven to be highly effective in the design of efficient convolutional neural networks that are better suited for mobile deployment than hand-designed networks. Hypothesizing that neural architecture search holds great potential for human pose estimation, we explore the application of neuroevolution, a form of neural architecture search inspired by biological evolution, in the design of 2D human pose networks for the first time. Additionally, we propose a new weight transfer scheme that enables us to accelerate neuroevolution in a flexible manner. Our method produces network designs that are more efficient and more accurate than state-of-the-art hand-designed networks. In fact, the generated networks process images at higher resolutions using less computation than previous hand-designed networks at lower resolutions, allowing us to push the boundaries of 2D human pose estimation. Our base network designed via neuroevolution, which we refer to as EvoPose2D-S, achieves comparable accuracy to SimpleBaseline while being 50% faster and 12.7x smaller in terms of file size. Our largest network, EvoPose2D-L, achieves new state-of-the-art accuracy on the Microsoft COCO Keypoints benchmark, is 4.3x smaller than its nearest competitor, and has similar inference speed. The code is publicly available at https://github.com/wmcnally/evopose2d.


I. INTRODUCTION
Two-dimensional human pose estimation is a visual recognition task dealing with the autonomous localization of anatomical human joints, or more broadly, "keypoints," in RGB images and video [1]- [5]. It is widely considered a fundamental problem in computer vision due to its many downstream applications, including action recognition [6]- [11] and human tracking [12]- [14]. In particular, it is a precursor to 3D human pose estimation [15]- [17], which serves as a potential alternative to invasive marker-based motion capture.
In line with other streams of computer vision, the use of deep learning [18], [19], and specifically deep convolutional neural networks [20] (CNNs), has been prevalent in 2D human pose estimation [1], [2], [14], [21]- [24]. The most accurate 2D human pose estimation methods use a two-stage, top-down pipeline, where an off-the-shelf person detector is first used to detect human instances in an image, and the 2D human pose network is run over the person detections to obtain keypoint predictions [14], [23], [24]. This paper focuses on the latter stage of this commonly used top-down pipeline, but we emphasize that our method is applicable to the design of bottom-up human pose estimation networks [22], [25] as well.
Recently, there has been a growing interest in the use of machines to help design CNN architectures through a process VOLUME 4, 2016 1 arXiv:2011.08446v2 [cs.CV] 4 Oct 2021 known as neural architecture search (NAS) [26]- [29]. NAS removes human bias from the design process and permits the automated exploration of diverse network architectures that often transcend human intuition and provide greater accuracy using less computation. Moreover, networks designed using NAS often have fewer parameters [30], which reduces the need for expensive main memory access on embedded hardware designed with small memory caches [31]. Despite the widespread success of NAS in many areas of computer vision [32]- [38], the design of 2D human pose networks has remained, for the most part, human-principled.
In this study, we explore the application of neuroevolution [39], a realization of NAS inspired by evolution in nature, to 2D human pose estimation for the first time. To run large-scale NAS experiments within a practical timeframe, we propose a new weight transfer scheme that is highly flexible and accelerates neuroevolution. We exploit this weight transfer scheme, along with large-batch training on high-bandwidth Tensor Processing Units (TPUs), to run fast neuroevolutions within a search space geared towards 2D human pose estimation. Our neuroevolution framework produces a 2D human pose network that has a relatively simple design, provides state-of-the-art accuracy when scaled, and uses fewer floating-point operations (FLOPs) and parameters than the best performing networks in the literature (see Fig.  1). The key contributions of this research are summarized as follows: • We propose a new weight transfer scheme to accelerate neuroevolution and apply neuroevolution to 2D human pose estimation for the first time. In contrast to previous neuroevolution methods that exploit weight transfer, our method is not constrained by complete function preservation [40], [41]. Despite relaxing this constraint, our experiments indicate that the level of functional preservation afforded by our weight transfer scheme is sufficient to provide fitness convergence, thereby simplifying neuroevolution and making it more flexible. • We present empirical evidence that large-batch training (i.e., batch size of 2048) can be used in conjunction with the Adam optimizer [42] to accelerate the training of 2D human pose networks with no loss in accuracy. We reap the benefits of large-batch training in our neuroevolution experiments by maximizing training throughput on high-bandwidth TPUs. • We design a search space conducive to 2D human pose estimation and leverage the above contributions to run a fast full-scale neuroevolution of 2D human pose networks (∼1 day using eight v2-8 TPUs). As a result, we are able to produce a computationally efficient 2D human pose estimation model that achieves state-ofthe-art accuracy on the most widely used benchmark dataset.

II. RELATED WORK
This work draws upon several areas of deep learning research to engineer a high-performing 2D human pose estimation HRNet SimpleBaseline FIGURE 1. A comparison of the accuracy, size, and computational cost of EvoPose2D, SimpleBaseline [14], and HRNet [24] at different scales. The circle size is proportional to the network file size. EvoPose2D-S provides comparable accuracy to SimpleBaseline (ResNet-50), is 12.7x smaller, and uses 4.9x fewer FLOPs. At full-scale, EvoPose2D-L obtains state-of-the-art accuracy using 2.0x fewer FLOPs and is 4.3x smaller compared to HRNet-W48. In contrast to the referenced methods, our EvoPose2D results do not make use of model-agnostic enhancements such as ImageNet pretraining, half-body augmentation, or non-maximum suppression during post-processing.
model. We review the three most relevant areas of the literature in the following sections.

A. LARGE-BATCH TRAINING OF DEEP NEURAL NETWORKS
It has been shown that training deep neural networks using large batch sizes with stochastic gradient descent causes a degradation in the quality of the model as measured by its ability to generalize to unseen data [43], [44]. Recently, Goyal et al. [45] implemented measures for mitigating the training difficulties caused by large batch sizes, including linear scaling of the learning rate, and an initial warmup period where the learning rate is gradually increased. Maximizing training efficiency using large-batch training is critical when the computational demand of training is very high, such as in neural architecture search. However, deep learning methods are often data-dependent and taskdependent, so it remains unclear whether the training measures imposed by Goyal et al. for image classification apply in the general case. It is also unclear whether the learning rate modifications are applicable to optimizers that use adaptive learning rates. Adam [42] is an example of such an optimizer and is widely used in 2D human pose estimation. In this paper, we empirically investigate the use of large batch sizes in conjunction with the Adam optimizer in the training of 2D human pose networks.

B. 2D HUMAN POSE ESTIMATION USING DEEP LEARNING
The first use of deep learning for human pose estimation came in 2014, when Toshev and Svegedy [1] regressed 2D keypoint coordinates directly from RGB images using a 2 VOLUME 4, 2016 cascade of deep CNNs. Arguing that the direct regression of pose vectors from images was a highly non-linear and difficult to learn mapping, Tompson et al. [2] introduced the notion of learning a heatmap representation. Mean squared error (MSE) was used to minimize the distance between the predicted and target heatmaps, where the targets were generated using Gaussians with small variance centered on the ground-truth keypoint coordinates.
Several of the methods that followed built upon iterative heatmap refinement in a multi-stage fashion including intermediate supervision [21], [22], [46]. Remarking the inefficiencies associated with multi-stage stacking, Chen et al. [23] proposed the Cascaded Pyramid Network (CPN), a holistic network constructed using a ResNet-50 [47] feature pyramid [48]. Xiao et al. [14] presented yet another single-stage architecture called SimpleBaseline, which stacked transpose convolutions on top of ResNet. Sun et al. [24] demonstrated with HRNet that maintaining high-resolution features throughout the entire network could provide greater accuracy. HRNet represents the state-of-the-art in 2D human pose estimation among peer-reviewed works at the time of writing.
An issue surrounding the 2D human pose estimation literature is that it is often difficult to make fair comparisons of model performance due to the heavy use of model-agnostic improvements. Examples include the use of different learning rate schedules [24], [49], more data augmentation [49], [50], loss functions that target more challenging keypoints [23], specialized post-processing steps [51], [52], or more accurate person detectors [49], [52]. These discrepancies in training algorithms can potentially account for the reported differences in accuracy. To directly compare our pose estimation architectures with the state-of-the-art, we re-implement Sim-pleBaseline [14] and HRNet [24] and train all networks under the same settings using the same hardware.

C. NEUROEVOLUTION
Neuroevolution is a form of neural architecture search that harnesses evolutionary algorithms to search for optimal network architectures [53], [54]. Network morphisms [41] and function-preserving mutations [40] are techniques that reduce the computational cost of neuroevolution. In essence, these methods iteratively mutate networks and perform weight transfer in such a way that the function of the network is completely preserved upon mutation, i.e., the output of the mutated network is identical to that of the parent network. Ergo, the mutated child networks need only be trained for a relatively small number of steps compared to when training from a randomly initialized state. As a result, these techniques are capable of reducing the search time to a matter of GPU days. However, function-preserving mutations can be challenging to implement and also restricting (e.g., complexity cannot be reduced [40]). Our proposed weight transfer scheme serves as a more flexible alternative that addresses these issues, is effective in accelerating neuroevolution, and has a simpler implementation.
We briefly discuss the important distinctions between neuroevolution methods that leverage weight transfer, and alternative NAS approaches that leverage weight sharing, such as ENAS [55] and DARTS [56]. Weight sharing approaches are sometimes referred to as one-shot architecture search [57], because architectures are sampled from a single, over-parameterized supergraph encompassing the entire search space (one-shot model). The search is performed over a single training run of the supergraph, where subgraphs are selected, evaluated using the supergraph weights, and then ranked. The best performing subgraph is finally trained from scratch. One-shot methods are based around the hypothesis that the ranking of the candidate subgraphs correlates with their true ranking following final training. However, Yu et al. observe that this correlation is very weak, and ultimately find that ENAS and DARTS perform no better than a random search [58]. Moreover, some one-shot methods require the entire supergraph to be kept in memory, which inherently limits the size of the search space. These issues are not a concern in neuroevolution because the candidate architectures are trained separately and thus do not share weights. In a recent benchmarking of NAS algorithms, neuroevolution methods were among the top performing algorithms and consistently outperformed random search [59]. NAS algorithms have predominantly been developed and evaluated on small-scale image datasets [28]. The use of NAS in more complex visual recognition tasks remains limited, in large part because the computational demands make it impractical. This is especially true for 2D human pose estimation, where training a single model can take several days [23]. Nevertheless, the use of NAS in the design of 2D human pose networks has been attempted in a few cases [60]- [62]. Although some of the resulting networks provided superior computational efficiency as a result of having fewer parameters and operations, none managed to surpass the best performing hand-crafted networks in terms of accuracy.

III. ACCELERATING NEUROEVOLUTION USING WEIGHT TRANSFER
Suppose that a pretrained "parent" neural network is represented by the function P x | θ (P) , where x is the input to the network and θ (P) are its learned parameters. The foundation of the proposed neuroevolution framework lies in the process by which the unknown parameters θ (C) in a mutated child network C are inherited from θ (P) such that C x | θ (C) ≈ P x | θ (P) . That is, the output, or "function," of the mutated child network is similar to the parent, but not necessarily equal. To enable fast neural architecture search, the degree to which the parent's function is preserved must be sufficient to allow θ (C) to be trained to convergence in a small fraction of the number of steps normally required when training from a randomly initialized state.
To formalize the proposed weight transfer in the context of 2D convolution, we denote W (l) ∈ R kp1×kp2×ip×op as the weights used by layer l of the parent network, and V (l) ∈ R kc1×kc2×ic×oc as the weights of the corresponding layer in the mutated child network, where k is the kernel size, i is Two examples (W → V1, W → V2) of the weight transfer used in the proposed neuroevolution framework. The trained weights (shown in blue) in the parent convolutional filter W are transferred, either in part or in full (V W ), to the corresponding filter V in the mutated child network. The weight transfer extends to all output channels in the same manner as depicted here for input channels.
the number of input channels, and o is the number of output channels. For the sake of brevity, we consider the special case when k p1 = k p2 = k p , k c1 = k c2 = k c , and o p = o c , but the following definition can easily be extended to when :, :, :ic, : . V W is transferred to V and the remaining non-inherited weights in V are randomly initialized. An illustration of the weight transfer between two convolutional layers is shown in Fig. 2. In principle, the proposed weight transfer can be used with convolutions of any dimensionality (e.g., 1D, 2D, or 3D convolutions), and is permitted between convolutional operators with different kernel size, stride, dilation, input channels, and output channels. More generally, it can be applied to any operations with learnable parameters, including batch normalization and dense layers.
In essence, the proposed weight transfer method relaxes the function-preservation constraint imposed in [40], [41]. In practice, we find that the proposed weight transfer preserves the majority of the function of deep CNNs following mutation. This enables us to perform network mutations in a simple and flexible manner while maintaining good parameter initialization in the mutated network. As a result, the mutated networks can be trained using fewer iterations, which accelerates the neuroevolution.

IV. FAST NEUROEVOLUTION OF 2D HUMAN POSE NETWORKS
This section includes the engineering details for our neuroevolution implementation that leverages the proposed weight transfer scheme to accelerate the evolution of a 2D human pose network. While we focus on the application of 2D human pose estimation, we note that our neuroevolution approach is generally applicable to all types of deep networks.

A. SEARCH SPACE
Neural architecture search helps moderate human involvement in the design of deep neural networks. However, neural architecture search is by no means fully automatic. To some extent, our role transitions from a network designer to a search designer. Decisions regarding the search space are particularly important because the search space encompasses all possible solutions to the optimization problem, and its size correlates with the amount of computation required to thoroughly explore the space. As such, it is common to exploit prior knowledge in order to reduce the size of the search space and ensure that the sampled architectures are tailored toward the task at hand [63].
Motivated by the simplicity and elegance of the Simple-Baseline architecture [14], we search for an optimal human pose estimation backbone using a search space inspired by [30], [32]. Specifically, the search space encompasses a single-branch hierarchical structure that includes seven modules stacked in series. Each module is constructed of chainlinked inverted residual blocks [31] that use an expansion ratio of six and squeeze-excitation [64]. For each module, we search for the optimal kernel size, number of inverted residual blocks, and output channels. Considering the newfound importance of spatial resolution in the deeper layers of 2D human pose networks [24], we additionally search for the optimal stride of the last three modules. Without going into too much detail, our search space can produce 10 14 unique backbones. To complete the network, an initial convolutional layer with 32 output channels precedes the seven modules, and three transpose convolutions with kernel size of 3x3, stride of 2, and 128 output channels are used to construct the network head. A diagram of the search space is provided in Fig. 3. Additional search space details are provided in Appendix A-A.

B. FITNESS
To strike a balance between computational efficiency and accuracy, we perform a multi-objective optimization that minimizes a fitness function including the validation loss and the number of network parameters. Given a 2D pose network represented by the function N x | θ (N ) , the loss L i for a single RGB input image I ∈ R h×w×3 and corresponding target heatmap S ∈ R h ×w ×K is given by where K is the number of keypoints and v represents the keypoint visibility flags 1 . The target heatmaps S are generated by centering 2D Gaussians with a standard deviation of h 64 pixels on the ground-truth keypoint coordinates and normalizing to a maximum intensity of 255. The overall validation loss is computed as: where N is the number of image samples in the validation dataset. Finally, the fitness of a network N is given by: where n(θ N ) is the number of parameters in N , T is the target number of parameters, and Γ controls the fitness tradeoff between the number of parameters and the validation loss. Minimizing the number of parameters instead of the number of floating-point operations (FLOPs) allows us to indirectly minimize FLOPs while not penalizing mutations that decrease the stride of the network.

C. EVOLUTIONARY STRATEGY
The evolutionary strategy proceeds as follows. In generation "0", a common ancestor network is manually defined and trained from scratch for e 0 epochs. In generation 1, λ children are generated by mutating the ancestor network. The mutation details are provided in Appendix A-B. The weight transfer outlined in Section III is performed between the ancestor and each child (additional implementation details provided in Appendix A-C), after which the children's weights are trained for e epochs (e e 0 ). At the end of generation 1, the µ networks with the best fitness from the pool of (λ + 1) networks (children + ancestor) become the parents in the next generation. In generation 2 and beyond, the mutation → weight transfer → training process is repeated and the top-µ networks from the pool of (λ + µ) networks (children + parents) become the parents in the next generation. The evolution continues until manual termination, typically after the fitness has converged.

D. LARGE-BATCH TRAINING
Even with the computational savings afforded by weight transfer, running a full-scale neuroevolution of 2D human pose networks at a standard input resolution of 256x192 would not be feasible within a practical time-frame using common GPU resources (e.g., 8-GPU server). To reduce the search time to within a practical range, we exploit large batch sizes when training 2D human pose networks on TPUs. In line with [45], we linearly scale the learning rate with the batch size and gradually ramp-up the learning rate during the first few epochs. In Section V-B, we empirically demonstrate that this training regimen can be used in conjunction with the Adam optimizer [42] to train 2D human pose networks up to a batch size of 2048 with no loss in accuracy. To our best knowledge, the largest batch size previously used to train a 2D human pose network was 256, which required 8 GPUs [49].

E. COMPOUND SCALING
It has been shown recently that scaling a network's resolution, width (channels), and depth (layers) together is more efficient than scaling one of these dimensions individually [32]. Motivated by this finding, we scale the base network found through neuroevolution to different input resolutions using the following depth (c d ) and width (c w ) coefficients: where r s is the search resolution, r is desired resolution, and α, β, γ are scaling parameters. For convenience, we use the same scaling parameters as in [32] (α = 1.2, β = 1.1, γ = 1.15) but hypothesize that better results could be obtained if these parameters were tuned.

1) Microsoft COCO
The 2017 Microsoft COCO Keypoints dataset [65] is the predominant dataset used to evaluate 2D human pose estimation models. It contains over 200k images and 250k person instances labeled with 17 keypoints. We fit our models to the training subset, which contains 57k images and 150k person instances. We evaluate our models on both the validation and test-dev sets, which contain 5k and 20k images, respectively. We report the standard average precision (AP) and average recall (AR) scores based on Object Keypoint Similarity (OKS) 1

2) PoseTrack
PoseTrack [13] is a large-scale benchmark for 2D human pose estimation and tracking in video. The dataset contains 1,356 video sequences, 46k annotated frames, and 276k person instances. The dataset was converted to COCO format and the COCO evaluation toolbox (pycocotools) was used to evaluate the accuracy in the multi-person human pose estimation task (i.e., using the same accuracy metrics as above). In experiments, we train our models on the 2018 training set (97k person instances), and evaluate on the 2018 validation set (45k person instances) using ground-truth bounding boxes.

B. LARGE-BATCH TRAINING OF 2D HUMAN POSE NETWORKS ON TPUS
To maximize training throughput on TPUs, we run experiments to investigate the training behaviour of 2D human pose networks using larger batch sizes than have been used previously. For these experiments, we re-implement the SimpleBaseline model of Xiao et al. [14] and train it on the Microsoft COCO dataset. The SimpleBaseline network stacks three transpose convolutions with 256 channels and kernel size of 3x3 on top of a ResNet-50 backbone, which is pretrained on ImageNet [66]. We run the experiments at an input resolution of 256 × 192, which yields output heatmap predictions of size 64 × 48. According to the TensorFlow profiler used, this model has 34.1M parameters and 5.21G FLOPs.

1) Implementation Details
The following experimental setup was used to obtain the results for all models trained on COCO in this paper. Additional implementation details for neuroevolution and Pose-Track training are provided in Sections V-C and V-C2, respectively. Preprocessing. The RGB input images were first normalized to a range of [0, 1], then centered and scaled by the ImageNet pixel means and standard deviations. The images were then transformed and cropped to the input size of the network. During training, random horizontal flipping, scaling, and rotation were used for data augmentation. The exact data augmentation configuration is provided in the linked code.
Training. The networks were trained for 200 epochs using bfloat16 floating-point format, which consumes half the memory compared to commonly used float32. The loss represented in Eq. (2) was minimized using the Adam opti-  mizer [42] with a cosine-decay learning rate schedule [67] and L2 regularization with 1e−5 weight decay. The base learning rate l r was set to 2.5e−4 and was scaled to l r · n 32 , where n is the global batch size. Additionally, a warmup period was implemented by gradually increasing the learning rate from l r to l r · n 32 over the first five epochs. The validation loss was evaluated after every epoch using the ground-truth bounding boxes.
Testing. The common two-stage, top-down pipeline was used during testing [14], [23], [24]. We use the same detections as [14], [24] and follow the standard testing protocol: the predicted heatmaps from the original and horizontally flipped images were averaged and the keypoint predictions were obtained after applying a quarter offset in the direction from the highest response to the second highest response. We do not use non-maximum suppression.

2) Large Batch Training Results
The batch size was doubled from an initial batch size of 256 until the memory of the v3-8 TPU was exceeded. The maximum batch size attained was 2048. The loss curves for the corresponding training runs are shown in Fig. 4. While the final training loss increased marginally with batch size, the validation losses converged in the latter part of training, signifying that the networks provide similar accuracy. The AP values provided in Table 1 confirm that we are able to train up to a batch size of 2048 with no loss in accuracy.  [14]. The bottom two rows highlight the importance of warmup and scaling the learning rate when using large batch sizes.
We hypothesize that the increase of 0.6 AP over the original SimpleBaseline implementation (AP of 70.4) was due to training for longer (200 epochs versus 140). Additionally, we demonstrate the importance of warmup and learning rate scaling. When training at the maximum batch size, removing warmup resulted in a loss of 1.3 AP, and removing learning rate scaling resulted in a loss of 0.7 AP. While preprocessing the data "online" on the TPU host CPU provides flexibility for training using different input resolutions and data augmentation, it ultimately causes a bottleneck in the input pipeline. This is evidenced by the training times in Table 1, which decreased after increasing the batch size to 512, but leveled-off at around 5.3 hours using batch sizes of 512 or greater. We expect that the training time could be reduced substantially if preprocessing and augmentation were included in the TFRecord dataset, or if the TPU host CPU had greater processing capabilities. It is also noted that training these models for 140 epochs instead of 200, as in the original implementation [14], reduces the training time to 3.7 hours. Bypassing validation after every epoch speeds up training further. For comparison, training a model of similar size on eight NVIDIA TITAN Xp GPUs takes approximately 1.5 days [23].

C. NEUROEVOLUTION
The neuroevolution described in Section IV was run under various settings on an 8-CPU, 40 GB memory virtual machine that called on eight v2-8 Cloud TPUs to train several generations of 2D human pose networks. The COCO training and validation sets were used for network training and fitness evaluation, respectively. The input resolution used was 256x192, and the target number of parameters T was set to 5M. Other settings, including Γ, λ, and µ, are provided in the legend of Fig. 5 (top). ImageNet pretraining was exploited by seeding the common ancestor network using the same inverted residual blocks as used in EfficientNet-B0 [32]. The ancestor network was trained for 30 epochs, and we utilize the proposed weight transfer scheme to quickly fine-tune the mutated child networks over just 5 epochs. A batch size of 512 was used to provide near-optimal training efficiency (as per results in previous section) and prevent memory exhaustion mid-search. No learning rate warmup was used during neuroevolution, and the only data augmentation used was horizontal flipping. All other training details are the same as in Section V-B1.

Fig. 5 (top)
shows the convergence of fitness for three independent neuroevolutions E1, E2, and E3, which had runtimes of 1.5, 0.8 and 1.1 days, respectively. The gap between the fitness (solid line) and validation loss (dashed line) was larger in E2 and E3 compared to E1, indicating that smaller networks were favored more as a result of decreasing Γ. After increasing the number of children from 32 in E2 to 64 in E3, it became apparent that using fewer children may provide faster convergence, but may also cause the fitness to converge to a local minimum. Fig. 5 (bottom) plots the validation loss against the number of parameters for all sampled networks. The prominent Pareto frontier near the bottom-left of the figure provides confidence that the search space was thoroughly explored.
To explicitly demonstrate the benefit of our proposed weight transfer scheme, E3 was run without weight transfer following the same training schedule. As shown in Fig. 5  (top), the fitness never decreased below that of the ancestor VOLUME 4, 2016  network. It stands that the child networks would need to be trained at least as long as the ancestor network (30 epochs in this case) to achieve the same level of convergence without using the proposed weight transfer scheme. As a result, the neuroevolution runtime would increase six-fold. The network with the lowest fitness from neuroevolution E3 was selected as the baseline network, which we refer to as EvoPose2D-S. Its architectural details are provided in Table 2. Notably, the overall stride of EvoPose2D-S is less than what is typically seen in hand-designed 2D human pose networks. The lowest spatial resolution observed in the network is 1 16 the input size, compared to 1 32 in SimpleBaseline [14] and HRNet [24]. As a result, the output heatmap is twice as large.
The baseline network was scaled to various levels of computational expense. A lighter version (EvoPose2D-XS) was created by increasing the stride in Module 6, which halved the number of FLOPs. Using the compound scaling method described in Section IV, EvoPose2D-S was scaled to an input resolution of 384x288 (EvoPose2D-M), which is currently the highest resolution used in top-down 2D human pose estimation. We push the boundaries of 2D human pose estimation by scaling to an input resolution of 512x384 (EvoPose2D-L). Even at this high spatial resolution, EvoPose2D-L has roughly half the FLOPs compared to the largest version of HRNet. The scaling parameters for EvoPose2D-M/L are provided in Appendix A-D.

2) Comparisons with the State-of-the-Art
Microsoft COCO. To directly compare EvoPose2D with the best methods in the literature, we re-implement SimpleBaseline ResNet-50 (SB-R50) and HRNet-W32 as per the implementation described in Section V-B1. In our implementation of HRNet, we use a strided transpose pointwise convolution in place of a pointwise convolution followed by nearestneighbour upsampling. This modification was required to make the model TPU-compatible, and did not change the number of parameters or FLOPs. The accuracy of our implementation is verified against the original in Table 3.
Comparing EvoPose2D-S with our SB-R50 implementation without ImageNet pretraining, we find that EvoPose2D-S provides comparable accuracy on the COCO validation set (see Table 3) but is 50% faster and 12.7x smaller. We also compare EvoPose2D-S to a baseline that stacks the EvoPose2D network head on top of EfficientNet-B0 [32], and find that while EvoPose2D-S is 20% slower due to its decreased stride, its AP is 1.8 points higher and it is 2.2x smaller. Compared to our HRNet-W32 (256x192) implementation, we observe that EvoPose2D-M is more accurate by 1.5 AP while being 23% faster and 3.9x smaller.
Despite not using ImageNet pretraining, EvoPose2D-L achieves state-of-the-art AP on the COCO validation set 2 (with and without PoseFix [51]) while being 4.3x smaller than HRNet-W48. Since EvoPose2D was designed using the COCO validation data, it is especially important to perform evaluation on the COCO test-dev set. We therefore show in Table 3 that EvoPose2D-L also achieves state-of-the-art accuracy on the test-dev dataset, again without ImageNet pretraining.
PoseTrack. For evaluation on PoseTrack, the networks were initialized with the weights pretrained on COCO and were fine-tuned on the Posetrack 2018 training set. All training details are consistent with Section V-B1, except the finetuning process was run for 20 epochs and early-stopping was used. As shown in Table 3, the relative performance of EvoPose2D compared to the state-of-the-art is consistent with the COCO dataset: EvoPose2D-S and EvoPose2D-M provide higher accuracy than SB-R50 and HRNet-W32, respectively, despite having fewer parameters and FLOPS, and faster inference speed.

VI. CONCLUSION
We propose a simple yet effective weight transfer scheme and use it, in conjunction with large-batch training, to accelerate a neuroevolution of efficient 2D human pose networks. To the best of our knowledge, this is the first application of neuroevolution to 2D human pose estimation. We additionally provide supporting experiments demonstrating that 2D human pose networks can be trained using a batch size of up to 2048 with no loss in accuracy. We exploit large-batch training and the proposed weight transfer to evolve a lightweight 2D human pose network design geared towards mobile deployment. When scaled to higher input resolution, the EvoPose2D network designed using neuroevolution proved to be more accurate than the best performing 2D human pose estimation models in the literature while having a lower computational cost. .

A. SEARCH SPACE DETAILS
A diagram of the hierarchical backbone search space is shown in Fig. 3. For each module, we search for the optimal number of blocks, kernel size, output channels, and stride (last three modules only). Table 4 shows the configuration of the common ancestor network used in our neuroevolution experiments. The kernel size options used were 3x3 and 5x5. The maximum number of blocks was set to four. The maximum number of output channels were set to the values in the common ancestor network (see rightmost column of Table 4). Table 2 shows the architecture of EvoPose2D-S, the network with the best fitness in neuroevolution E3 (see Section V-C). The optimal number of output channels in the first five modules were at the upper bound, so it is possible that better results might be obtained if these limits were increased.

B. MUTATION DETAILS
The sampled architectures were encoded into 7x4 integer arrays (# blocks, kernel size, output channels / 8, and stride, for each module), which we refer to as the genotype. The mutations used included increasing/decreasing the number of blocks by 1, changing the kernel size, increasing/decreasing the stride by 1, and increasing/decreasing the number of output channels by 8. During neuroevolution, the genotypes VOLUME 4, 2016 were cached to ensure that no genotype was sampled twice. The mutation function is provided in Algorithm 1.

Algorithm 1: Mutation
Input: parent genotype gp, ancestor genotype ga, genotype cache G Output: mutated child genotype gc gc ← gp while gc in G or gc = gp do gc ← gp i, j ← randint (7)

C. WEIGHT TRANSFER DETAILS
The child network architectures were decoded from the mutated child genotypes, and all weights in the child networks were randomly initialized. Then, the weight transfer scheme described in Section III was used to transfer the trained weights from the parents to the children. For batch normalization layers, the non-transferred weights were initialized with the means of the parent. When a new block was added as a result of a mutation, the weights from the parent's last block were transferred to the new block in the child.

D. SCALING COEFFICIENTS
The scaling coefficients used for EvoPose2D-M and EvoPose2D-L are provided in Table 5 He has authored over 560 refereed journal and conference papers and patents, in various fields, such as computational imaging, artificial intelligence, computer vision, graphics, image processing, and multimedia systems. His research interests focus on integrative biomedical imaging systems design, operational artificial intelligence, and scalable and explainable deep learning. He has received a number of awards, including two Outstanding Performance Awards, the Distinguished Performance Award PROFESSOR JOHN MCPHEE is the Canada Research Chair in System Dynamics at the University of Waterloo, Canada, which he joined in 1992. Prior to that, he held fellowships at Queen's University, Canada, and the Université de Liège, Belgium.
He pioneered the use of linear graph theory and symbolic computing to create real-time models and model-based controllers for multi-domain dynamic systems, with applications ranging from autonomous vehicles to rehabilitation robots and sports engineering. His research algorithms are a core component of the widely-used MapleSim modelling software, and his work appears in more than 160 journal publications.
Prof. McPhee is the past Chair of the International Association for Multibody System Dynamics, a co-founder of 2 international journals and 3 technical committees, a member of the Golf Digest Technical Panel, and an Associate Editor for 5 journals. He is a Fellow of the Canadian Academy of Engineering, the American and Canadian Societies of Mechanical Engineers, and the Engineering Institute of Canada. He has won 8 Best Paper Awards and, in 2014, he received the prestigious NSERC Synergy Award from the Governor-General of Canada.