Center point to pose: Multiple views 3D human pose estimation for multi-person
Fig 1
Our CTP network avoids the complex matching task.
(a) We estimate 2D heatmaps from all views. (b) When all 2D keypoint heatmaps projected into 3D common space, the space is voxelized into regular grids. (c) After convolution by front-layer in backbone, we get the preliminary 3D feature maps. (d) The 3D feature maps are transformed into 2D feature maps and passed into 2D CNN network. The center of one person is generated in top view. (e)The 3D bounding box is regressed. (f) The 3D bounding box is voxelized into more detailed grids for estimating accurate 3D pose. (g) The estimation of 3D poses outputs from our network.