SPA: Sparse Photorealistic Animation Using a Single RGB-D Camera

Photorealistic animation is a desirable technique for computer games and movie production. We propose a new method to synthesize plausible videos of human actors with new motions using a single cheap RGB-D camera. A small database is captured in a usual office environment, which happens only once for synthesizing different motions. We propose a marker-less performance capture method using sparse deformation to obtain the geometry and pose of the actor for each time instance in the database. Then, we synthesize an animation video of the actor performing the new motion that is defined by the user. An adaptive model-guided texture synthesis method based on weighted low-rank matrix completion is proposed to be less sensitive to noise and outliers, which enables us to easily create photorealistic animation videos with new motions that are different from the motions in the database. Experimental results on the public data set and our captured data set have verified the effectiveness of the proposed method.

Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s. S e e h t t p://o r c a . cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s. Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

I. INTRODUCTION
P HOTOREALISTIC animation aims to create a plausible photorealistic video of an actor performing a new motion based on a database, which is highly desirable for both computer games and movie production [1]- [3]. On the one hand, the rendered results of fully animated human characters are not realistic and on the other hand, the captured videos are difficult to synthesize new motions. Video texture methods [4], [5] attempt to generate videos with new motions by rearranging the subsequences, but it is difficult to create truly new motions. Recently, some methods [6], [7] set up a multicamera system to help better synthesize videos with new motions. They achieve promising results with the help of multiview information. However, systems of this kind are expensive, difficult to maintain, and need many manual operations. Moreover, they need the actor to clench their hands. In this paper, we try to achieve photorealistic animation using a single cheap RGB-D camera by capturing the database in a usual office environment. Our system is nonintrusive and easy to set up.
The input for photorealistic animation contains several videos with various basic motions, while the output is a plausible video corresponding to a new motion defined by users. Creating a new video with new motions from limited existing motions is essentially an ill-posed problem, because the limited available textures of motions do not contain complete information to reconstruct the textures for the new motion. Moreover, the appearance of actors continually changes with their motions when they perform different motions, such as folds and wrinkles. The key to generate appealing textures is to impose proper priors to make the problem well posed. Recently, sparse representation has shown its great power in the regularization of the ill-posed estimation problem, such as in surface reconstruction [8], [9], 3D shape denoising [10], and depth enhancement [11].
Data-driven photorealistic animation method can effectively use knowledge from the database. By using the sparse priors, we can reduce the dependency on high quality or complete input, which makes it possible to use a single RGB-D camera for photorealistic animation. In our work, we propose a new method to synthesize photorealistic videos of human actors with user-defined motions based on a small database captured by a single RGB-D camera in a usual office environment, allowing a small change of viewpoint. We present a sparse deformation optimization method to make the marker-less performance capture less affected by noise and outliers, which gives a pivotal constraint for video synthesis. Besides, we address the video synthesis problem by adaptive weighted low-rank matrix completion. Our method offers the following advantages.
1) Cheap Nonintrusive System: Our method achieves photorealistic animation using only a single cheap RGB-D camera. 2) Less Requirement: Our method does not require the actor to wear skin-tight garments, attach markers, or clench the hands. 3) Accuracy: By using sparse priors, our method can recover the textures with high accuracy from a very small data set. 4) Robustness: By using sparse representation, our method is less sensitive to noise and outliers than previous methods. We demonstrate the power and effectiveness of our method on a public data set and our captured data sets. Our method achieves plausible animation videos for all the new motions.
The main contributions of this paper are as follows. 1) Photorealistic Animation Using a Single Cheap RGB-D Camera in a Usual Environment: We capture the database using a Kinect v2.0 camera in a usual office environment, and generate a plausible photorealistic animation video using the proposed method. 2) Marker-Less Performance Capture Method: To obtain more consistent results with the real motion, we propose a two-step marker-less performance capture method. A multipriority inverse kinematics method and the skinning method are first used to get an initial mesh for each frame, and then a sparse deformation method is adopted to generate more accurate meshes. 3) Adaptive Sparse Texture Synthesis Method: To recover the textures from limited data set with different motions, we propose an adaptive weighted low-rank matrix completion method. Through this method, not only is the frame that has high percentages of missing data recovered, but also the noise and outliers in the initially estimated image are reduced. The remainder of this paper is structured as follows. Section II provides a brief review of related work in the field of animation. A system overview of the proposed method is given in Section III. The technical details of database setup, retrieval, and video synthesis are described in Sections IV-VI, respectively. Validation experiments and results are presented in Section VII, and this paper is concluded in Section VIII.

II. RELATED WORK
This section provides a brief review of related work in the field of animation. Animation approaches are mainly divided into three categories: 1) skeleton-based methods; 2) model-based methods; and 3) image-based methods. The most common skeleton-based method is linear blend skinning (LBS) [12], but this method has bad twisting animation result around joints. To overcome this problem, Wang and Phillips [13] and Merry et al. [14] compute different weights from different poses of models. Mohr and Gleicher [15] generate animation by introducing some dummy joints. Kavan et al. [16] replace the classic LBS method by a dual quaternions skinning method. However, the results of these methods look unrealistic.
Model-based methods can generate more realistic animations by building a database that has some 3D model samples with high accuracy and strong sense of reality. De Aguiar et al. [17] use principal component analysis to reduce the dimensions of human body models and clothing models, and then learn their relationship by linear regression to generate new models with target motions. Wang et al. [18] analyze surface vertices for different body parts, and choose different samples to synthesize the 3D model with a new motion. However, these kinds of methods usually demand very high computational complexities, and need to sacrifice some accuracy and authenticity for reasonable computational complexity.
Image-based methods aim at generating realistic textures and have low computational complexity. The video texture method [4] is an early work in this field, which analyzes a video clip to extract its structure and creates a new video by rearranging the frames. As an extension, video sprites [19] are proposed to animate moving objects. It finds good frame arrangements based on repeated partial replacements of the sequence, which allows the user to specify animations using a flexible cost function.
It is challenging to use the above methods to generate new videos for moving humans without any knowledge of 3D shapes and poses. Celly and Zordan [20] preliminarily solve the problem by identifying transition regions with human-specific feature extraction and performing an image-based warping afterward. Flagg et al. [5] adopt marker-based motion capture and construct a video-based motion graph to synthesize a new video.
All the above methods generate a new video by rearranging the subsequences with specific motions, and no novel motions can be created. Starck et al. [21] use 3D video sequences by combining image-based reconstruction and video-based animation to allow controlled animation of human body from captured multiview video sequences. The blended video sequences are constructed offline and represented as a motion graph for interactive animation. Similarly, Huang et al. [22] synthesize novel 3D video sequences by finding the optimal path in the motion graph between user-specified key-frames for control of movement, location, and timing. Casas et al. [6] introduce a new representation, 4D video texture, for rendering photorealistic animations from a multiview multimotion database. However, the generated new sequences are restricted to the space of basic motions in the database, and texture synthesis is not considered. Xu et al. [7] set up a multicamera system with 12 cameras and achieve realistic video synthesis of a novel target motion using a model-guided image warping method. However, this method relies on an expensive multicamera system, and requires the actor to clench hands. When the query motion is different from the basic database motions and less view information is available, there are obvious blurring artifacts on the recovered textures. In contrast, with sparse representation, our approach only uses a single cheap camera and achieves plausible animation videos without such artifacts.
The Microsoft Kinect camera has been widely used due to its low cost and multisensing [23]- [25]. The version 2.0 sold last year has more accuracy in color, depth, and skeleton tracking. In this paper, we use a single Kinect v2.0 camera to capture a multimotion database. A multipriority inverse kinematics method is used to ensure the topology consistency of Kinect skeletons and a sparse deformation method is proposed to ensure the accuracy of mesh deformation. For synthesizing textures, instead of using existing projective texturing and blending methods [26], [27], which tend to produce texture ghosting, we propose a new adaptive model-guided texture synthesis method based on weighted low-rank matrix completion.  Fig. 1 illustrates the workflow of our sparse photorealistic animation approach. The input to our system is a skeleton sequence and a rigged surface mesh of an actor. The output is a synthesized photorealistic animation video with the input motion. To make this possible, we take a data-driven approach. A small database is first set up offline, which contains videos and meshes with skeletons performing various basic motions (Section IV). Then, appropriate images are retrieved from the database according to the input skeleton sequence (Section V). Finally, a photorealistic animation video is synthesized based on adaptive weighted low-rank matrix completion using the retrieved images (Section VI).

III. SYSTEM OVERVIEW
Database: We use a Kinect camera to capture RGB-D images of a character performing various basic motions, and obtain a 3D scanned template mesh of the character using the Kinect Fusion technique [28]. Then, we manually embed a skeleton into the template mesh, compute the skinning weight [29], and segment the body of the mesh into 16 parts according to the maximum skinning weight of each vertex. Finally, we compute a 3D skeleton and a 3D surface mesh for each frame of the database. Therefore, the database contains a color image sequence, a skeleton sequence, and a mesh sequence for each motion. The database is only needed to be built once for generating new motions of the same character.
Query: A query sequence is a user-defined skeleton sequence that specifies the actor motion in the target video, which uses the same template mesh model and skeleton as the database. Arbitrary animation tools can be used to define the whole animation sequence. Alternatively, it is also possible to use motion retargeting techniques [30] to apply motion capture data from existing databases to the given skeleton.
Retrieval: A candidate image is retrieved from the database according to the spatiotemporal similarity for each frame of target motion. To achieve high performance retrieval, we consider the similarity of motion and the completeness of sequence in our scheme.
Synthesis: We synthesize the target sequence using the retrieved frames based on adaptive weighted low-rank matrix completion, which can recover the matrix that has high percentages of missing data and can also reduce the noise and outliers in the known elements.

IV. DATABASE SETUP
We capture the color images, depth images, and skeletons for a database using a single Kinect v2.0 camera, and then we compute and optimize the skeletons and dynamic meshes using the proposed method.

A. Acquisition
We capture an actor (actress) performing various basic motions using a Kinect v2.0 camera. The basic motions we used are walking with flexed legs, running, marching, waving, stretching, front-kicking, side-kicking, salute,andkungfu. Each depth image is transformed into a depth mesh and segmented by an RANSAC algorithm to remove the floor and a bounding box of the actor to remove the background. Skeleton is identified using a pose recognition method [31] at each time instance.
Besides, we also obtain a 3D scanned mesh without textures of the actor using the Kinect camera by capturing the 3D point clouds in a circle and merging them together using the Kinect Fusion technique [28]. The Kinect Fusion method only works for rigid objects, so the actor being scanned needs to stand still and the Kinect camera is controlled by another person to scan from 360 • directions. The underlying skeleton of the mesh is generated by manually marking the joint positions. We calibrate the coordinate system relationship of the static template mesh and the captured depths of Kinect by computing the rigid transformation using four skeleton joints (spine base, spine center, left hip, and right hip) of the skeleton of the template mesh and the skeleton of the first frame of Kinect.

B. Marker-Less Performance Capture
With the 3D template mesh and the captured depth+skeleton sequences, we present a skeleton-based marker-less performance capture method to obtain a new skeleton sequence and a dynamic mesh sequence for each motion. Because the skeletons captured by Kinect do not maintain topological consistency, we use a multipriority inverse kinematics method [32] to obtain pose parameters and new skeletons with the consistent topology. The skinning weights are automatically calculated for each vertex, which describe the association of the vertex with each bone [29]. With these weights, an LBS method is adopted to deform the template mesh using the calculated pose parameters. Then, we optimize the deformed meshes with depths based on sparse representation to make the deformed mesh consistent with the data captured by Kinect.
Let us define the deformed mesh by the LBS method as M s and the vertices of M s as P 1] ⊤ and N is the number of vertices. Similarly, we denote the vertices of the depth mesh M t by Q {q 1 ,...,q M }. We compute the visible vertices of the deformed mesh at the viewpoint of the depth camera and find the closest correspondences on the depth mesh. Specifically, we find the closest point for each vertex of the deformed mesh and calculate the distance between this vertex and its closest point. Then, we choose the ones whose distances are smaller than the median as the correspondences. Denote the correspondence point of p i ∈ P by q f (i) ∈ Q,w h e r e f :{1, ··· , N} →{ 1,...,M} represents the index mapping.
Given the correspondence mapping f [33], we compute a3× 4 transformation matrix T i for each vertex of M s by minimizing the following energy function: , N i represents one round neighborhood connecting with edges, and · 1 represents the ℓ 1 norm of the matrix. π ij = exp(−H 2 /σ 2 ) allows larger nonrigid transformation around curved places, where H is the mean curvature and σ is a constant. The weight ω i is set at zero if p i does not have a corresponding point in Q, and set at one otherwise. The first term reflects the fidelity of the estimated transformations and ensures the accuracy of the reconstruction, while the second term as regularization ensures the smoothness of the transformations.
We further define a differential matrix L ∈{ − 1, 1} G×N with G representing the number of edges of M s . Each row of L corresponds to an edge of M s , and each column of L corresponds to a vertex of M s .F o rt h er th edge that connects vertex p i and vertex p j ,w eh a v eL r,i = 1a n d L r, j =−1. Therefore, (1) can be rewritten as T is a 4N × 3m a t r i xb yt a k i n gT i as its column, is a weight matrix with π ij as its elements, "•" represents element-wise multiplication of two matrices, I 4 is a 4 × 4 identity matrix, and ⊗ denotes the operator of the Kronecker product. We iteratively find the closest correspondences and solve (2) using the alternate direction method (ADM) [34] until convergence. In our experiments, we use 15 outer iterations for sparse deformation and 25 inner iterations for the ADM algorithm. Fig. 2 demonstrates the effectiveness of the proposed optimization method. The mesh before optimization is not very consistent with the real motion, while the mesh after optimization is very consistent with the real motion, which can be seen from the projection results.
V. R ETRIEVAL Similar to [7] but simpler, we use database retrieval to find the best matched candidate frames for video synthesis based on spatiotemporal consistency. In order to compare the query skeletons with the database skeletons, we register them to the same coordinate system by translating the spine-base joints of database skeletons to the same 3D position of the query's joint and rotating the database skeletons to face the same direction. We define the front viewpoint in the database as the reference viewpoint. So given a new query viewpoint, we compute the transformation matrix between the new viewpoint and the reference viewpoint, and transform the query skeleton using this matrix for later retrieval.
Then, we compute an energy for each frame of the data set and choose T (2 in our experiments) frames with lowest energy values as the candidates for video synthesis. Considering the spatiotemporal continuity, the energy function is expressed as where F is the unknown candidate sequence, N(F) is the number of frames in the candidate sequence F, I (F) is the set of frame indices in the original sequence, and R is the number of frames in the query sequence. S d is the vector of skeleton joints of a database sequence and S q is the vector of skeleton joints of the query sequence. The skeletal distance is defined as where m and n are d or q, S j is the 3D position of the j th skeletal joint in the world coordinate system, J is the number of joints, and σ j is the variance of the position of joint j in the database. In (4), the first term is a spatial constraint, which ensures the similarity between the skeleton in the candidate sequence and that in the query sequence. The second term is to ensure the similarity of amount of motion between the candidate sequence and the query sequence, and the third term is to make the sequence from the same data set as long as possible. These two terms impose the temporal continuity, avoiding the jitter effect.

VI. VIDEO SYNTHESIS BASED ON ADAPTIVE MATRIX COMPLETION
In this section, we synthesize each target frame based on adaptive weighted low-rank matrix completion. Initial estimations are first obtained by warping the retrieved candidates with the help of 3D models. Then, the frames warped from the candidate that has the lowest energy value are optimized using an adaptive weighted matrix completion method to synthesize a frame of the animation video.
For our problem, no existing image completion methods can be directly applied after initial estimation because the video synthesis in our problem is challenging.
1) A large range of missing data need to be repaired.
2) Many of missing pixels are around the boundary (the junction of the foreground and the background), which is difficult to correctly recover.
3) The texture to be restored may be complex. Therefore, we propose a video synthesis method based on weighted low-rank matrix completion and elegantly design an adaptive block size scheme to remove the block artifacts around boundary.

A. Initial Estimation
With query skeletons, we can obtain query meshes by deforming the static template mesh using the multipriority inverse kinematics method and DQS skinning method mentioned in Section IV. Based on the assumption that the surface meshes in the database and query are generated by deforming the same static 3D model, i.e., the meshes in the database and query have the same connectivity but different poses, we warp the retrieved database frames (source frames) using the vertex correspondences based on moving least squares [35]. Specifically, we first segment the body of the query mesh into 16 parts according to the maximum skinning weight of each vertex, and hence obtain 16 segments in the target frame via projection. Then, we project the meshes of the database and the mesh of the query onto the source and target frames. Using the vertex correspondences as a guide, moving least squares is adopted to compute the corresponding pixels in the source frames for all the pixels in each body part of the target frame. For the pixels on the boundary of two body parts, we compute a weighted blending of two warping results on the assumption about the attribution of the pixel. In this way, we obtain an image with some missing regions for each retrieved candidate frame.

B. Sparse Texture Reconstruction
The initial estimation generated by Section VI-A may contain missing pixels because not all the needed information is included in the retrieved database frame. In this step, we take the initial estimated image I from the candidate with the lowest retrieval energy E(F) from (5) as reference and interpolate the missing regions based on the adaptive weighted matrix completion. Fig. 3 shows the procedure of the proposed sparse texture reconstruction (STR) method.
We generate a silhouette image I s by projecting the query mesh onto the target frame. Let us denote the known human body region in I by F , the unknown human body region by U , and the background region by G . For each pixel x i at the boundary of the unknown region U ,w eg e tab l o c kB i with size m × m centered at x i and compute its priority P by The first term w i is set at 0 for the pixels around background and 1 otherwise, which is judged by examining the difference in pixel values on I s between left and right pixels or top and bottom pixels. This guarantees that the texture recovery always uses the foreground pixels. The second term gives high priority to the pixels on edges, e.g., folds and wrinkles, which guarantees that the recovered texture has rich details. ▽I ⊥ i is a vector orthogonal to the gradient at pixel x i but has the same magnitude, and n i is a unit vector orthogonal to the boundary 1 represents the highest priority and 0 is the lowest priority. The block centered at the pixel with the highest priority is illustrated with a pink border and enlarged for closer observation. This block and several searched similar blocks (shown with a cyan border) are arranged into a matrix, and then the matrix is recovered with a weighted low-rank matrix completion method. In this way, we get the reconstructed texture for the current block from the first column of the recovered matrix. The top-right image gives the reconstructed texture for the top-left image.
of the unknown region. The third term considers that the pixels close to the known human body region F a r em o r er e l i a b l e than those far away. Instead of searching the nearest pixel in F , we estimate the distance in a greedy manner where N j is the eight-connected neighborhood of x j .Ifx j lies in the original known region, we set d j = 0. The parameter ξ controls the decay rate of the exponential and is set at 3 in our experiments. Specifically, the parameter is tuned by the bisection method. We found that the optimal parameters for different data sets were quite close, and the performance around the optimal values was stable. Therefore, we use the same parameter setting for all the experiments in our paper. After determining the priorities for all the pixels at the boundaries, we recover the textures beginning from the pixel with the highest priority. Let us denote B 0 as the block centered at the pixel with the highest priority. We search K similar blocks within a range of n × n pixels on I and the same locations on the other candidate images by computing the sum of zero-mean normalized cross-correlation (ZNCC) on the silhouette image I s and the RGB channels of I Suppose the pixel (a, b) in image I is the center of block B 0 and the pixel (a + x, b + y) is the center of block B i with disparity (x, y). The ZNCC between two blocks centered at the corresponding pixels is computed by where u and v are pixel indices in the two (2U +1)×(2V +1) blocks. R u,v and S x+u,y+v represent the value of the pixel (a + u, b + v) in block B 0 and the value of the pixel (a + x + u, b + y + v) in block B i , respectively. R and S x,y are the mean values of the (2U + 1) × (2V + 1) block. Note that we compute ZNCC using the pixels in the known regions.
In our experiments, we set U and V at (m − 1)/2. We process the image for each channel of RGB color space separately. Taking one channel for example, we vectorize the current block and all the similar blocks, and stack them into a matrix: On the one hand, there is a strong correlation between each column of D, and hence, the rank of D should be very low. On the other hand, the known entries in D are corrupted by considerable amount of noise and outliers due to nonconsistent texture between different candidate images, self-occlusion, and lighting. Therefore, we use an observation model with missing data and noise: D = A + E,w h e r eA is the latent low-rank matrix to be recovered and E represents the noise and outliers in the known elements that is sparse. So we formulate the problem as min A,E rank(A) + λ||E|| 0 s.t. P (D) =P (A + E) (10) where · 0 represents the ℓ 0 norm of the matrix (number of nonzero entries), λ is the weighting factor, is the index set of known elements, and P is a projection operator that projects the matrix onto the domain of . The optimization problem (10) is extremely difficult (NP-hard in general) to solve. So the rank and the ℓ 0 -norm are relaxed into the nuclear-norm (sum of the singular values) and ℓ 1 -norm, respectively. There may be some mismatching in the searched similar blocks. Hence, considering different contributions of each similar block, we adopt a weighted matrix completion model where ||A|| * is the nuclear norm of the matrix A, • represents element-wise multiplication of two matrices, and W is a weighting matrix to consider the similarity of searched blocks. Concretely, the weighting matrix is defined as is an all-one vector, and w := [w 0 ,w 1 ,w 2 ,...,w K ]∈R (K +1)×1 is a vector containing the similarity of the searched blocks to the current block B 0 In this way, blocks that are more similar to the current block are assigned with larger weights, and hence contribute more to the final restoration results. We solve the weighted matrix completion problem (11) using the augmented Lagrangian method [36]. The augmented Lagrangian function of minimization (11) is where Y is the Lagrangian multiplier, denotes the inner product of two matrices (defined in the same way as the inner product of two vectors), · F denotes the matrix Frobenius norm, and µ>0 is a scalar variable to adjust the consistency of the recovered matrix to the observed values. We minimize the augmented Lagrangian optimization using alternating direction method (ADM ) [34]. We take the first column of the recovered matrix A and reshape it into an m × m block as the recovered block of B 0 .
Adaptive Block Size: The reconstruction algorithm processes the missing pixels progressively from interior pixels toward the boundary. Then, the anchor block will step across the boundary of the human body and contain background pixels, which would affect the search of similar blocks. We adaptively reduce the block size to handle this issue. The size of the anchor block is recursively reduced until it does not contain background pixels. Then, missing pixels of the anchor blocks with reduced size are reconstructed in the same way as normal blocks. If the size of the anchor block is reduced to one, it is simply filled by the average of available pixels in its four-connected neighborhood.
Since we synthesize the textures frame by frame, some jitter artifacts may occur when the recovered textures are temporally inconsistent. This can be easily improved by any temporal blending technique based on optical flow [7], [37].

VII. RESULTS
In this section, we evaluate the performances of the proposed method on a public data set (Section VII-A) and a real Kinect data set (Section VII-B), and some discussions are given in Section VII-C. We set the parameters as follows: block width m = 11, range width n = 100, and number of similar blocks K = 30.

A. Evaluation on Public Data Set
We generate a data set for evaluation using the multiview data set of the multiview video-based character (MVC) method [7] by keeping only one view for each motion. In order to compare with the method in [7], we use the same retrieval method and the background for synthesizing the new video. Fig. 4 gives the comparison results for walking, punching, and turning motions. The regions highlighted by rectangles are enlarged and shown in the associated images for closer observation. It can be seen that the results of the MVC method have obvious blurring artifacts on the legs, while our method is free of this problem and provides significantly better results than MVC method. This appealing property is attributed to the proposed adaptive weighted sparse reconstruction model (11). The constraint reflects the fidelity of the reconstructed matrix and ensures the accuracy of the reconstruction.
The objective takes the low-rankness and sparsity into account and ensures the smoothness of the reconstruction. By bridging the two terms with a penalization parameter and then minimizing it, not only are the unknown elements recovered, but the noise and outliers in the known elements are also reduced. Adaptive block-size selection ensures that the recovery is implemented on the necessary resolution: large enough to consider the global consistency and small enough to recover the local details.

B. Evaluation on Kinect Data Sets
Our method is further evaluated on the data sets captured with a Kinect RGB-D camera. We capture an actor (actress) with normal clothing performing nine basic motions using a Kinect v2.0 camera at 30 frames/s with a color image resolution of 1920 × 1080 pixels and a depth image resolution of 512 × 424 pixels. We calibrate the color camera and the depth camera using OpenCV (Open Source Computer Vision) library. All the data sets will be made publicly available at a project webpage.
1) Sparse Texture Reconstruction: We first assess the performance of the STR component (Section VI-B) that plays an important role in synthesizing photorealistic videos. In Fig. 5, we compare the STR method with three other methods: 1) exemplar-based inpainting method (EBIM) [38]; 2) the MVC method [7]; and 3) the STR method without adaptive block sizes (STRwoABS). In EBIM [38], the missing pixels of the current block are filled by the counterparts of the most similar block. EBIM considers only texture information in determining the order of inpainting, which is not applicable to our case with silhouette constraints. For fair comparison, we use the same priority calculation scheme as the proposed method. As shown in Fig. 5(a), the results generated by the EBIM method present obviously wrong textures around the silhouette of the human body. The reason for the artifacts is that the straightforward copy of similar blocks may introduce some wrong pixels, which are further propagated by the recursive inpainting procedure. The result of the MVC method is blurred. On the contrary, the proposed STR method is able to reconstruct correct textures thanks to the powerful recovery capability of the weighted low-rank matrix completion from incomplete observations. The result in Fig. 5(c) contains severe blocking artifacts around the boundaries of the human body while that in Fig. 5(d) is free of this issue, which demonstrates the effectiveness of the proposed block-size adaptation scheme.
The proposed STR method is able to challenge some extreme cases with large missing regions. For demonstration, we synthesize video frames for a given query bow-pulling motion only from a single candidate image (the leftmost image in Fig. 6). The poses of the actress can significantly deviate that in the single available image, e.g., the right arm raises and blends in the bow-pulling motion. In such cases, the initially estimated images would present large missing areas, which are quite challenging to reconstruct as only a small fraction of pixels are available. Fig. 6 shows ten frames of the synthesized video for the bow-pulling motion at different poses. The results show that the STR method is able to reconstruct visually appealing results from even one single candidate image.   2) Sparse Photorealistic Animation: Fig. 7 shows ten frames of the synthesized video for the query lateral raising motion. The results show that the poses of the actress in the synthesized frames are consistent with the poses of the query skeletons. The frames are reliably reconstructed according to the new motion, which is demonstrated by the completeness of the reconstructed textures, e.g., the logo texture on the T-shirt, two white stripes on the sleeves, and the laces on the shoes. Fig. 8 presents five reconstructed frames for the query walking motion. In the candidate videos of the data sets, the actress does not clench her fists, which is different from data Fig. 9. Five frames of our created video for another actor at a new viewpoint. The inlays show the associated query skeleton used to create the target frame. Fig. 10. Animation results of an actor that are created with our method from a small database captured by a single RGB-D camera. The motion is designed by an animator and the background is captured by a commercial camera. Relighting techniques are not used. sets in most previous photorealistic animation work, and it is difficult to reconstruct the hands. As observed in Fig. 8, our method is able to provide acceptable results although it is difficult to reconstruct the fine details of the hands with flexible motions. Fig. 9 shows the results of another actor for the robotic walking motion at a new viewpoint. The synthesized frames are of the same quality as the results for other data sets, although some small white holes caused by the segmentation error can be seen on the right leg of the actor in the last image. This suggests that our method has stable performance for various actors and query motions. Fig. 10 shows a composite of our synthesized video into a real scene. Four frames of the created video are given. The motion is designed by an animator and the background is captured by a commercial camera. Relighting techniques are not used. The results demonstrate that it is feasible to insert our video-based animations with new motions into real captured videos.

C. Discussion
Experimental results show that our method achieves plausible animation videos. The visual quality depends on the quality of the database and the similarity between new motion and database motions. Visual artifacts in either segmentation or performance capture will degrade the quality of video synthesis for new motion. Some segmentation errors will bring background color into the foreground region when reconstructing the textures, while errors in performance capture will also affect the accuracy of model-guided image deformation. The reasons for artifacts are as follows.
1) The segmentation of depth is not accurate enough without the use of chroma-key backgrounds. This is the main reason, because the depth boundary directly affects the boundary of the synthesized frame (please see Sections IV-B and VI-A). We use a common segmentation method [39] for Kinect to segment the depth. Even using manual segmentation, it is difficult to ensure the boundary consistency for all the motions due to the depth accuracy of Kinect, especially for the hair. 2) For database setup, we did not use a special design for uniform lighting. Instead, we capture the data in the normal official environment with several fluorescent lights on the ceiling, which leads to some highlights and inconsistent color on the actor, particularly on the face. The above analysis can be verified from the experiment using the data set from [7], in which there is no such artifact for our method because the data set in [7] is captured with a chroma-key background and uniform lighting. If a more accurate depth camera is used, the reconstructed mesh and the recovered texture will be better. However, we did not involve post-processing such as temporally blending and did not use an expensive depth camera, because we want to demonstrate the possibility of achieving animation videos in a simple way of using a single cheap RGB-D camera and capturing the database in a usual office environment.
In our experiments, we only use nine data sets and usually only three data sets are used for video synthesis after retrieval. If the new query motion is very different from the database motions, adding more data sets with similar kind of motions might be helpful.
The main limitation of our approach is the computational complexity of the STR method. We run the algorithm on a laptop with an Intel(R) Core(TM) i5-4200M 2.5-GHz CPU and 4.0-GB RAM. The running time is about 10 min/frame, which is not suitable for real-time applications.
Our method does not include color correction and relighting. We try to minimize the color difference of different data sets by using uniform lighting when capturing the data set images. In future work, multisource co-color-correction methods can be used to adjust the colors of different database images to be consistent, and intrinsic image decomposition methods can be adopted together with relighting techniques to increase the realism of the generated video [40].

VIII. CONCLUSION
In this paper, we present a new photorealistic animation method using a single RGB-D camera. Based on a small database with some basic motions, we create a plausible video of an actor performing a new motion. We obtain more accurate deformed meshes for marker-less performance capture based on sparse representation. We also propose a sparse reconstruction method for textures using adaptive weighted low-rank matrix completion, which is less sensitive to noise and outliers. Our system is cheap and nonintrusive. We demonstrate the effectiveness of the proposed method on a public data set and our captured data sets. The potential of our method is verified on an extreme case in Fig. 6: all frames of the new motion were synthesized from the same frame while the visual quality is still quite good. Our method achieves compelling animations, despite the poor quality of the input database.