Real-Time Large Crowd Rendering with Efficient Character and Instance Management on GPU

. Achieving the efficient rendering of a large animated crowd with realistic visual appearance is a challenging task when players interact with a complex game scene. We present a real-time crowd rendering system that efficiently manages multiple types of character data on the GPU and integrates seamlessly with level-of-detail and visibility culling techniques. The character data, including vertices, triangles, vertexnormals, texturecoordinates, skeletons, and skinning weights, are stored as either buffer objects or textures in accordance with their access requirements at the rendering stage. Our system preserves the view-dependent visual appearance of individual character instances in the crowd and is executed with a fine-grained parallelization scheme. We compare our approach with the existing crowd rendering techniques. The experimental results show that our approach achieves better rendering performance and visual quality. Our approach is able to render a large crowd composed of tens of thousands of animated instances in real time by managing each type of character data in a single buffer object.


Introduction
Crowd rendering is an important form of visual effects.In video games, thousands of computer-articulated polygonal characters with a variety of appearances can be generated to inhabit in a virtual scene like a village, a city, or a forest.Movements of the crowd are usually programmed through a crowd simulator [1][2][3][4] with given goals.To achieve a realistic visual approximation of the crowd, each character is usually tessellated with tessellation algorithms [5], which increases the character's mesh complexity to a sufficient level, so that fine geometric details and smooth mesh deformations can be preserved in the virtual scene.As a result, the virtual scene may end up with a composition of millions of, or even hundreds of millions of, vertices and triangles.Rasterizing such massive amount of vertices and triangles into pixels is a high computational cost.Also, when storing them in memory, the required amount of memory may be beyond the storage capability of a graphic hardware.Thus, in the production of video games [6][7][8][9], advanced crowd rendering technologies are needed in order to increase the rendering speed and reduce memory consumption while preserving the crowd's visual fidelity.
To improve the diversity of character appearances in a crowd, a common method is duplicating a character's mesh many times and then assigning each duplication with a different texture and a varied animation.Some advanced methods allow developers to modify the shape proportion of duplications and then retarget rigs and animations to the modified meshes [10,11] or synthesize new motions [12,13].With the support of hardware-accelerated geometryinstancing and pseudo-instancing techniques [9,[14][15][16], multiple data of a character, including vertices, triangles, textures, skeletons, skinning weights, and animations, can be cached in the memory of a graphics processing unit (GPU).At each time when the virtual scene needs to be rendered, the renderer will alter and assemble those data dynamically without the need of fetching them from CPU main memory.However, storing the duplications on the GPU consumes a large amount of memory and limits the number of instances that can be rendered.Furthermore, even though the instancing technique reduces the CPU-GPU 2 International Journal of Computer Games Technology communication overhead, it may suffer the lack of dynamic mesh adaption (e.g., continuous level-of-detail).
In this work, we present a rendering system, which achieves a real-time rendering rate for a crowd composed of tens of thousands of animated characters.The system ensures a fully utilization of GPU memory and computational power through the integration with continuous level-ofdetail (LOD) and View-Frustum Culling techniques.The size of memory allocated for each character is adjusted dynamically in response to the change of levels of detail, as the camera's viewing parameters change.The scene of the crowd may end up with more than one hundred million triangles.Different from existing instancing techniques, our approach is capable of rendering all different characters through a single buffer object for each type of data.The system encapsulates multiple data of each unique source characters into buffer objects and textures, which can then be accessed quickly by shader programs on the GPU as well as maintained efficiently by a general-purpose GPU programming framework.
The rest of the paper is organized as follows.Section 2 reviews the previous works about crowd simulation and crowd rendering.Section 3 gives an overview of our system's rendering pipeline.In Section 4, we describe fundamentals of continues LOD and animation techniques and discuss their parallelization on the GPU.Section 5 describes how to process and store the source character's multiple data and how to manage instances on the GPU.Section 6 presents our experimental results and compares our approach with the existing crowd rendering techniques.We conclude our work in Section 7.

Related Work
Simulation and rendering are two primary computing components in a crowd application.They are often tightly integrated as an entity to enable a special form of in situ visualization, which in general means data is rendered and displayed by a renderer in real time while a simulation is running and generating new data [17][18][19].One example is the work presented by Hernandez et al. [20] that simulated a wandering crowd behavior and visualized it using animated 3D virtual characters on GPU clusters.Another example is the work presented by Perez et al. [21] that simulated and visualized crowds in a virtual city.In this section, we first briefly review some previous work contributing to crowd simulation.Then, more related to our work, we focus on acceleration techniques contributing to crowd rendering, including levelof-detail (LOD), visibility culling, and instancing techniques.
A crowd simulator uses macroscopic algorithms (e.g., continuum crowds [22], aggregate dynamics [23], vector fields [24], and navigation fields [25]) or microscopic algorithms (e.g., morphable crowds [26] and socially plausible behaviors [27]) to create crowd motions and interactions.Outcomes of the simulator are usually a successive sequence of time frames, and each frame contains arrays of positions and orientations in the 3D virtual environment.Each pair of position and orientation information defines the global status of a character at a given time frame.McKenzie et al. [28] developed a crowd simulator to generate noncombatant civilian behaviors which is interoperable with a simulation of modern military operations.Zhou et al. [29] classified the existing crowd modeling and simulation technologies based on the size and time scale of simulated crowds and evaluated them based on their flexibility, extensibility, execution efficiency, and scalability.Zhang et al. [30] presented a unified interaction framework on GPU to simulate the behavior of a crowd at interactive frame rates in a fine-grained parallel fashion.Malinowski et al. [31] were able to perform large scale simulations that resulted in tens of thousands of simulated agents.
Visualizing a large number of simulated agents using animated characters is a challenging computing task and worth an in-depth study.Beacco et al. [9] surveyed previous approaches for real-time crowd rendering.They reviewed and examined existing acceleration techniques and pointed out that LOD techniques have been used widely in order to achieve high rendering performance, where a far-away character can be represented with a coarse version of the character as the alternative for rendering.A well-known approach is using discrete LOD representations, which are a set of offline simplified versions of a mesh.At the rendering stage, the renderer selects a desired version and renders it without any additional processing cost at runtime.However, Discrete LODs require too much memory for storing all simplified versions of the mesh.Also, as mentioned by Cleju and Saupe [32], discrete LODs could cause "popping" visual artifacts because of the unsmooth shape transition between simplified versions.Dobbyn et al. [33] introduced a hybrid rendering approach that combines image-based and geometry-based rendering techniques.They evaluated the rendering quality and performance in an urban crowd simulation.In their approach, the characters in the distance were rendered with image-based LOD representations, and they were switched to geometric representations when they were within a closer distance.Although the visual quality seemed better than using discrete LOD representations, popping artifacts also occurred when the renderer switches content between image representations and geometric representations.Ulicny et al. [34] presented an authoring tool to create crowd scenes of thousands of characters.To provide users immediate visual feedback, they used low-poly meshes for source characters.The meshes were kept in OpenGL's display lists on GPU for fast rendering.
Characters in a crowd are polygonal meshes.The mesh is rigged by a skeleton.Rotations of the skeleton's joints transform surrounding vertices, and subsequently the mesh can be deformed to create animations.While LOD techniques for simplifying general polygonal meshes have been studied maturely (e.g., progressive meshes [35], quadric error metrics [36]), not many existing works studied how to simplify animated characters.Landreneau and Schaefer [37] presented mesh simplification criteria to preserve deforming features of animations on simplified versions of the mesh.Their approach was developed based on quadric error metrics, and they added simplification criteria with the consideration of vertices' skinning weights from the joints of the skeleton and the shape deviation between the character's rest pose and a deformed shape in animations.Their approach produced more accurate animations for dynamically simplified characters than many other LOD-based approaches, but it caused a higher computational cost, so it may be challenging to integrate their approach into a real-time application.Willmott [38] presented a rapid algorithm to simplify animated characters.The algorithm was developed based on the idea of vertex clustering.The author mentioned the possibility of implementing the algorithm on the GPU.However, in comparison to the algorithm with progressive meshes, it did not produce well-simplified characters to preserve fine features of character appearance.Feng et al. [39] employed triangular-char geometry images to preserve the features of both static and animated characters.Their approach achieved high rendering performance by implement geometry images with multiresolutions on the GPU.In their experiment, they demonstrated a real-time rendering rate for a crowd composed of 15.3 million triangles.However, there could be a potential LOD adaptation issue if the geometry images become excessively large.Peng et al. [8] proposed a GPUbased LOD-enabled system to render crowds along with a novel texture-preserving algorithm on simplified versions of the character.They employed a continuous LOD technique to refine or reduce mesh details progressively during the runtime.However, their approach was based on the simulation of single virtual humans.Instantiating and rendering multiple types of characters were not possible in their system.Savoy et al. [40] presented a web-based crowd rendering system that employed the discrete LOD and instancing techniques.
Visibility culling technique is another type of acceleration techniques for crowd rendering.With visibility culling, a renderer is able to reject a character from the rendering pipeline if it is outside the view frustum or blocked by other characters or objects.Visibility culling techniques do not cause any loss of visual fidelity on visible characters.Tecchia et al. [41] performed efficient occlusion culling for a highly populated scene.They subdivided the virtual environmental map into a 2D grid and used it to build a KD-tree of the virtual environment.The large and static objects in the virtual environment, such as buildings, were used as occluders.Then, an occlusion tree was built at each frame and merged with the KD-tree.Barczak et al. [42] integrated GPU-accelerated View-Frustum Culling and occlusion culling techniques into a crowd rendering system.The system used a vertex shader to test whether or not the bounding sphere of a character intersects with the view frustum.A hierarchical Z buffer image was built dynamically in a vertex shader in order to perform occlusion culling.Hernandez and Isaac Rudomin [43] combined View-Frustum Culling and LOD Selection.Desired detail levels were assigned only to the characters inside the view frustum.
Instancing techniques have been commonly used for crowd rendering.Their execution is accelerated by GPUs with graphics API such as DirectX and OpenGL.Zelsnack [44] presented coding details of the pseudo-instancing technique using OpenGL shader language (GLSL).The pseudoinstancing technique requires per-instance calls sent to and executed on the GPU.Carucci [45] introduced the geometryinstancing technique which renders all vertices and triangles of a crowd scene through a geometry shader using one call.Millan and Rudomin [14] used the pseudo-instancing technique for rendering full-detail characters which were closer to the camera.The far-away characters with low details were rendered using impostors (an image-based approach).Ashraf and Zhou [46] used a hardware-accelerated method through programmable shaders to animated crowds.Klein et al. [47] presented an approach to render configurable instances of 3D characters for the web.They improved XML3D to store 3D content in a more efficient way in order to support an instancing-based rendering mechanism.However, their approach lacked support for multiple character assets.

System Overview
Our crowd rendering system first preprocesses source characters and then performs runtime tasks on the GPU. Figure 1 illustrates an overview of our system.Our system integrates View-Frustum Culling and continuous LOD techniques.
At the preprocessing stage, a fine-to-coarse progressive mesh simplification algorithm is applied to every source character.In accordance with the edge-collapsing criteria [8,35], the simplification algorithm selects edges and then collapses them by merging adjacent vertices iteratively and then removes the triangles containing the collapsed edges.The edge-collapsing operations are stored as data arrays on the GPU.Vertices and triangles can be recovered by splitting the collapsed edges and are restored with respect to the order of applying coarse-to-fine splitting operations.Vertex normal vectors are used in our system to determine a proper shading effect for the crowd.A bounding sphere is computed for each source character.It tightly encloses all vertices in all frames of the character's animation.The bounding sphere will be used during the runtime to test an instance against the view frustum.Note that bounding spheres may be in different sizes because the sizes of source characters may be different.Other data including textures, UVs, skeletons, skinning weights, and animations are packed into textures.They can be accessed quickly by shader programs and the general-purpose GPU programming framework during the runtime.
The runtime pipeline of our system is executed on the GPU through five parallel processing components.We use an instance ID in shader programs to track the index of each instance, which corresponds to the occurrence of a source character at a global location and orientation in the virtual scene.A unique source character ID is assigned to each source character, which is used by an instance to index back to the source character that is instantiated from.We assume that the desired number of instances is provided by users as a parameter in the system configuration.The global positions and orientations of instances simulated from a crowd simulator are passed into our system as input.They determine where the instances should occur in the virtual scene.The component of View-Frustum Culling determines the visibility of instances.An instance will be considered to be visible if its bounding sphere is inside or intersects with the

Fundamentals of LOD Selection and Character Animation
During the step of preprocessing, the mesh of each source character is simplified by collapsing edges.Same as existing work, the collapsing criteria in our approach preserves features at high curvature regions [39] and avoids collapsing the edges on or crossing texture seams [8].Edges are collapsed one-by-one.We utilized the same method presented in [48], which saved collapsing operations into an array structure suitable for the GPU architecture.The index of each array element represents the source vertex and the value of the element represents the target vertex it merges to.By using the array of edge collapsing, the repositories of vertices and triangles are rearranged in an increasing order, so that, at runtime, the desired complexity of a mesh can be generated by selecting a successive sequence of vertices and triangles from the repositories.Then, the skeleton-based animations are applied to deform the simplified meshes.Figure 2 shows the different levels of detail of several source characters that are used in our work.In this section, we brief the techniques of LOD Selection and character animation.

LOD Selection.
Let us denote  as the total number of instances in the virtual scene.A desired level of details for an instance can be represented as the pair of {V, }, where V is the desired number of vertices, and  is the desired number of triangles.Given a value of V, the value of  can be retrieved from the prerecorded edgecollapsing information [48].Thus, the goal of LOD Selection is to determine an appropriate value of V for each instance, with considerations of the available GPU memory size and the instance's spatial relationship to the camera.If an instance is outside the view frustum, V is set to zero.For the instances inside the view frustum, we used the LOD Selection metric in [48] to compute V, as shown in where Equation ( 1) is the same as the first-pass algorithm presented by Peng and Cao [48].It originates from the model perception method presented by Funkhouser et al. [49] and is improved by Peng and Cao [48,50] to accelerate the rendering of large CAD models.We found that (1) is also a suitable metric for the large crowd rendering.In the equation,  refers to the total number of vertices that can be retained on the GPU, which is a user-specified value computed based on the available size of GPU memory.The value of  can be tuned to balance the rendering performance and visual quality.  is the weight computed with the projected area of the bounding sphere of the th instance on the screen (  ) and the distance to the camera (  ). is the perception parameter introduced by Funkhouser et al. [49].In our work, the value of  is set to 3.
With V and , the successive sequences of vertices and triangles are retrieved from the vertex and triangle repositories of the source character.By applying the parallel triangle reformation algorithm [48,50], the desired shape of the simplified mesh is generated using the selected vertices and triangles.

Animation.
In order to create character animations, each LOD mesh has to be bound to a skeleton along with skinning weights added to influence the movement of the mesh's vertices.As a result, the mesh will be deformed by rotating joints of the skeleton.As we mentioned earlier, each vertex may be influenced by a maximum of four joints.We want to note that the vertices forming the LOD mesh are a subset of the original vertices of the source character.There is not any new vertex introduced during the preprocess of mesh simplification.Because of this, we were able to use original skinning weights to influence the LOD mesh.When transformations defined in an animation frame are loaded on the joints, the final vertex position will be computed by summing the weighted transformations of the skinning joints.Let us denote each of the four joints influencing a vertex V as   , where  ∈ [0, 3].The weight of   on the vertex V is denoted as    .Thus, the final position of the vertex V, denoted as   V , can be computed by using In ( 2),  V is the vertex position at the time when the mesh is bound to the skeleton.When the mesh is first loaded without the use of animation data, the mesh is placed in the initial binding pose.When using an animation, the inverse of the binding pose needs to be multiplied by an animated pose.This is reflected in the equation, where

Source Character and Instance Management
Geometry-instancing and pseudo-instancing techniques are the primary solutions for rendering a large number of instances, while allowing the instances to have different global transformations.The pseudo-instancing technique is used in OpenGL and calls instances' drawing functions oneby-one.The geometry-instancing technique is included in Direct3D since the version 9 and in OpenGL since version 3.3.It advances the pseudo-instancing technique in terms of reducing the number of drawing calls.It supports the use of a single drawing call for instances of a mesh and therefore reduces the communication cost of sending call requests from CPU to GPU and subsequently increases the performance.As regards data storage on the GPU, buffer objects are used for shader programs to access and update data quickly.A buffer object is a continuous memory block on the GPU and allows the renderer to rasterize data in a retained mode.In particular, a vertex buffer object (VBO) stores vertices.An index buffer object (IBO) stores indices of vertices that form triangles or other polygonal types used in our system.

International Journal of Computer Games Technology
In particular, the geometry-instancing technique requires a single copy of vertex data maintained in the VBO, a single copy of triangle data maintained in the IBO, and a single copy of distinct world transformations of all instances.However, if the source character has a high geometric complexity and there are lots of instances, the geometry-instancing technique may make the uniform data type in shaders hit the size limit, due to the large amount of vertices and triangles sent to the GPU.In such case, the drawing call has to be broken into multiple calls.
There are two types of implementations for instancing techniques: static batching and dynamic batching [45].The single-call method in the geometry-instancing technique is implemented with static batching, while the multicall method in both the pseudo-instancing and geometry-instancing techniques are implemented with dynamic batching.In static batching, all vertices and triangles of the instances are saved into a VBO and IBO.In dynamic batching, the vertices and triangles are maintained in different buffer objects and drawn separately.The implementation with static batching has the potential to fully utilize the GPU memory, while dynamic batching would underutilize the memory.The major limitation of static batching is the lack of LOD and skinning supports.This limitation makes the static batching not suitable for rendering animated instances, though it has been proved to be faster than dynamic batching in terms of the performance of rasterizing meshes.
In our work, the storage of instances is managed similarly to the implementation of static batching, while individual instances can still be accessed similarly to the implementation of dynamic batching.Therefore, our approach can be seamlessly integrated with LOD and skinning techniques, while taking the use of a single VBO and IBO for fast rendering.This section describes the details of our contributions for character and instance management, including texture packing, UV-guided mesh rebuilding, and instance indexing.

Packing Skeleton, Animation, and Skinning Weights into
Textures.Smooth deformation of a 3D mesh is a computationally expensive process because each vertex of the mesh needs to be repositioned by the joints that influence it.We packed the skeleton, animations, and skinning weights into 2D textures on the GPU, so that shader programs can access them quickly.The skeleton is the binding pose of the character.As explained in (2), the inverse of the binding pose's transformation is used during the runtime.In our approach, we stored this inverse into the texture as the skeletal information.For each joint of the skeleton, instead of storing individual translation, rotation, and scale values, we stored their composed transformation in the form of a 4 × 4 matrix.Each joint's binding pose transformation matrix takes four RGBA texels for storage.Each RGBA texel stores a row of the matrix.Each channel stores a single element of the matrix.By using OpenGL, matrices are stored as the format of GL RGBA32F in the texture, which is a 32-bit floating-point type for each channel in one texel.Let us denote the total number of joints in a skeleton as .Then, the total number of texels to store the entire skeleton is 4.We used the same format for storing the skeleton to store an animation.Each animation frame needs 4 texels to store the joints' transformation matrices.Let us denote the total number of frames in an animation as .Then, the total number of texels for storing the entire animation is 4.For each animation frame, the matrix elements are saved into successive texels in the row order.Here we want to note that each animation frame starts from a new row in the texture.
The skinning weights of each vertex are four values in the range of [0, 1], where each value represents the influencing percentage of a skinning joint.For each vertex, the skinning weights require eight data elements, where the first four data elements are joint indices, and the last four are the corresponding weights.In other words, each vertex requires two RGBA texels to store the skinning weights.The first texel is used to store joint indices, and the second texel is used to store weights.

UV-Guided Mesh Rebuilding.
A 3D mesh is usually a seamless surface without boundary edges.The mesh has to be cut and unfolded into 2D flatten patches before a texture image can be mapped onto it.To do this, some edges have to be selected properly as boundary edges, from which the mesh can be cut.The relationship between the vertices of a 3D mesh and 2D texture coordinates can be described as a texture mapping function (, , ) → {(  ,   )}.Inner vertices (those not on boundary edges) have a one-to-one texture mapping.In other words, each inner vertex is mapped to a single pair of texture coordinates.For the vertices on boundary edges, since boundary edges are the cutting seams, a boundary vertex needs to be mapped to multiple pairs of texture coordinates.Figure 3 shows an example that unfolds a cube mesh and maps it into a flatten patch in 2D texture space.In the figure,   stands for a point in the 2D texture space.Each vertex on the boundary edges is mapped to more than one points, which are (V 0 ) = { 1 ,  3 }, (V 3 ) = { 10 ,  12 }, (V 4 ) = { 0 ,  4 ,  6 }, and (V 7 ) = { 7 ,  9 ,  13 }.
In a hardware-accelerated renderer, texture coordinates are indexed from a buffer object, and each vertex should associate with a single pair of texture coordinates.Since the texture mapping function produces more than one pairs of texture coordinates for boundary vertices, we conducted a mesh rebuilding process to duplicate boundary vertices and mapped each duplicated one to a unique texture point.By doing this, although the number of vertices is increased due to the cuttings on boundary edges, the number of triangles is the same as the number of triangles in the original mesh.In our approach, we initialized two arrays to store UV information.One array stores texture coordinates, the other array stores the indices of texture points with respect to the order of triangle storage.Algorithm 1 shows the algorithmic process to duplicate boundary vertices by looping through all triangles.In the algorithm,  is the array of original vertices storing 3D coordinates (, , ). is the array of original triangles storing the sequence of vertex indices.Similar to ,  is the array of indices of texture points in 2D texture space and represents the same triangular topology as the mesh.Note that the order of triangle storage  11) end for (12) end if (13) end for (14) end for (15)   ← ; Algorithm 1: UV-Guided mesh rebuilding algorithm.
for the mesh is the same as the order of triangle storage for the 2D texture patches. is the array of vertex normal vectors. is the total number of original triangles, and  is the number of texture points in 2D texture space.
After rebuilding the mesh, the number of vertices in   will be identical to the number of texture points , and the array of triangles (  ) is replaced by the array of indices of the texture points ().

Source Character and Instance
Indexing.After applying the data packing and mesh rebuilding methods presented in Sections 5.1 and 5.2, the multiple data of a source character are organized into GPU-friendly data structures.The character's skeleton, skinning weights, and animations are packed into textures and read-only in shader programs on the GPU.The vertices, triangles, texture coordinates, and vertex normal vectors are stored in arrays and retained on the GPU.During the runtime, based on the LOD Selection result (see Section 4.1), a successive subsequence of vertices, triangles, texture coordinates, and vertex normal vectors are selected for each instance and maintained as single buffer objects.As mentioned in Section 4.1, the simplified instances are constructed in a parallel fashion through a general-purpose GPU programming framework.Then, the framework interoperates with the GPU's shader programs and allows shaders to perform rendering tasks for the instances.Because continuous LOD and animated instancing techniques are assumed to be used in our approach, instances have to be rendered one-by-one, which is the same as the way of rendering animated instances in geometry-instancing and pseudo-instancing techniques.However, our approach needs International Journal of Computer Games Technology Reforming selected triangles to desired shapes Figure 4: An example illustrating the data structures for storing vertices and triangles of three instances in VBO and IBO, respectively.Those data are stored on the GPU and all data operations are executed in parallel on the GPU.The VBO and IBO store data for all instances that are selected from the array of original vertices and triangles of the source characters.V and  arrays are the LOD section result.
to construct the data within one execution call, rather than dealing with per-instance data.
Figure 4 illustrates the data structures of storing VBO and IBO on the GPU.Based on the result of LOD Selection, each instance is associated with a V and a  (see Section 4.1) that represent the amount of vertices and triangles selected based on the current view setting.We employed CUDA Thrust [51] to process the arrays of V and  using the prefix sum algorithm in a parallel fashion.As a result, for example, each V[] represents the offset of vertex count prior to the th instance, and the number of vertices for the th instance is Algorithm 2 describes the vertex transformation process in parallel in the vertex shader.It transforms the instance's vertices to their destination positions while the instance is being animated.In the algorithm, ℎ represents the total number of source characters.The inverses of the binding pose skeletons are a texture array denoted as V [𝑐ℎ𝑎𝑟𝑁𝑢𝑚].The skinning weights are a texture array denoted as ℎ [𝑐ℎ𝑎𝑟𝑁𝑢𝑚].We used a walk animation for each source character, and the texture array of the animations is denoted as [ℎ].  is the global 4 × 4 transformation matrix of the instance in the virtual scene.This algorithm is developed based on the data packing formats described in Section 5.1.Each source character is assigned with a unique source character ID, denoted as   in the algorithm.The drawing calls are issued per instance, so   is passed into the shader as an input parameter.The function of () computes the coordinates in the texture space to locate which texel to fetch.The input of () includes the current vertex or joint index () that needs to be mapped, the width () and height (ℎ) of the texture, and the number of texels () associating with the vertex or joint.For example, to retrieve a vertex's skinning weights, the  is set to 2; to retrieve a joint's transformation matrix, the  is set to 4. In the function of (), vertices of the instance are transformed in a parallel fashion by the composed matrix () computed from a weighted sum of matrices of the skinning joints.The function () takes a texture and the coordinates located in the texture space as input.It returns the values encoded in texels.The () function is usually provided in a shader programming framework.Different from the rendering of static models, animated characters change geometric shapes over time due to continuous pose changes in the animation.In the algorithm,   stands for the current frame index of the instance's animation.  is updated in the main code loop during the execution of the program.

Experiment and Analysis
We implemented our crowd rendering system on a workstation with Intel i7-5930K 3.50GHz PC with 64GBytes of RAM and an Nvidia GeForce GTX980 Ti 6GB graphics card.The rendering system is built with Nvidia CUDA Toolkit 9.2 and OpenGL.We employed 14 unique source characters.Table 1 shows the data configurations of those source characters.Each source character is articulated with a skeleton and fully skinned and animated with a walk animation.Each one contains an unfolded texture UV set along with a texture image at the resolution of 2048 × 2048.We stored those source characters on the GPU.Also, all source characters have been preprocessed by the mesh simplification and animation algorithms described in Section 4. We stored the edgecollapsing information and the character bounding spheres in arrays on the GPU.In total, the source characters require 184.50MB memory for storage.The size of mesh data is much smaller than the size of texture images.The mesh data requires 16.50MB memory for storage, which is only 8.94% of the total memory consumed.At initialization of the system, we randomly assigned a source character to each instance.

Visual Fidelity and Performance Evaluations.
As defined in Section 4.1,  is a memory budget parameter that determines the geometric complexity and the visual quality of the entire crowd.For each instance in the crowd, the corresponding bounding sphere is tested against the view frustum to determine its visibility.The value of  is only distributed across visible instances.
We created a walkthrough camera path for the rendering of the crowd.The camera path emulates a gaming navigation behavior and produces a total of 1,000 frames.The entire crowd contains 30,000 instances spread out in the virtual scene with predefined movements and moving trajectories.
Figure 5 shows a rendered frame of the crowd with the value of  set to 1, 6, and 10 million, respectively.The main camera moves on the walkthrough path.The reference camera is aimed at the instances far away from the main camera and shows a close-up view of those far-away instances.Our LOD-based instancing method ensures the total number of selected vertices and triangles is within the specified memory budget, while preserving the fine details of instances that are closer to the main camera.Although the far-away instances are simplified significantly, because the long distances to the main camera, their appearance in the main camera do not cause a loss of visual fidelity.Figure 5(a) shows visual appearance of the crowd rendered from the viewpoint of the main camera (top images), in which far-away instances are rendered using the simplified versions (bottom images).If all instances are rendered at the level of full detail, the total number of triangles would be 169.07 million.Through the simulation of the walkthrough camera path, we had an average of 15, 771 instances inside the view frustum.The maximum and minimum numbers of instances inside the view frustum are 29,967 and 5,038, respectively.We specified different values for .Table 2 shows the performance breakdowns with regard to the runtime processing components in our system.In the table, the "# of Rendered Triangles" column includes the minimum, maximum, and averaged number of triangles selected during the runtime.As we can see, the higher the value of  is, the more the triangles are selected to generate simplified instances and subsequently the better visual quality is obtained for the crowd.Our approach is memory efficient.Even when  is set to a large value, such as 20 million, the crowd needs only 26.23 million triangles in average, which is only 15.51% of the original number of triangles.When the value of  is small, the difference between the averaged and the maximum number of triangles is significant.For example, when  is equal to 5 million, the difference is at a ratio (V/) of 73.96%.This indicates that the number of triangles in the crowd varies significantly according to the change of instance-camera relationships (including instances' distance to the camera and their visibility).This is because a small

Comparison and Discussion
. We analyzed two rendering techniques and compared them against our approach in terms of performance and visual quality.The pseudoinstancing technique minimizes the amount of data duplication by sharing vertices and triangles among all instances, but it does not support LOD on a per-instance level [44,52].
The point-based technique renders a complex geometry by using a dense of sampled points in order to reduce the computational cost in rendering [53,54].The pseudo-instancing technique does not support View-Frustum Culling.For the comparison reason, in our approach, we ensured all instances to be inside the view frustum of the camera by setting a fixed position of the camera and setting fixed positions for all instances, so that all instances are processed and rendered by our approach.The complexity of each instance rendered by the point-based technique is selected based on its distance to the camera which is similar to our LOD method.When an instance is near the camera, the original mesh is used for rendering; when the instance is located far away from the camera, a set of points are approximated as sample points to represent the instance.In this comparison, the pseudo-instancing technique always renders original meshes of instances.We chose two different N values (= 5 million and = 10 million) for rendering with our approach.As shown in Figure 7, our approach results in better performance than the pseudo-instancing technique.This is because the number of triangles rendered by using the pseudo-instancing technique is much larger than the number of triangles determined by the LOD Selection component of our approach.The performance of our approach becomes better than the point-based technique as the number of  increases.Figure 8 shows the comparison of visual quality among our approach, pseudo-instancing technique, and point-based technique.The image generated from the pseudo-instancing technique represents the original quality.Our approach can achieve better visual quality than the point-based technique.As we can see in the top images of the figure, the instances far away from the camera rendered by the point-based technique appear to have "holes" due to the disconnection between vertices.In addition, the popping artifact appears when using the point-based technique.This is because the technique uses a limited number of detail levels from the technique of discrete LOD.Our approach avoids the popping artifact since continuous LOD representations of the instances are applied during the rendering.

Conclusion
In this work, we presented a crowd rendering system that takes the advantages of GPU power for both general-purpose and graphics computations.We rebuilt the meshes of source characters based on the flatten pieces of texture UV sets.We organized the source characters and instances into buffer objects and textures on the GPU.Our approach is integrated seamlessly with continuous LOD and View-Frustum Culling techniques.Our system maintains the visual appearance by assigning each instance an appropriate level of details.We evaluated our approach with a crowd composed of 30,000 instances and achieved real-time performance.In comparison with existing crowd rendering techniques, our approach better utilizes GPU memory and reaches a higher rendering frame rate.
In the future, we would like to integrate our approach with occlusion culling techniques to further reduce the number of vertices and triangles during the runtime and improve the visual quality.We also plan to integrate our crowd rendering system with a simulation in a real game application.Currently, we only used a single walk animation in the crowd rendering system.In a real game application, more animation types should be added, and a motion graph should be created in order to make animations transit smoothly from one to another.We also would like to explore possibilities to transplant our approach onto a multi-GPU platform, where a more complex crowd could be rendered in real time, with the supports of higher memory capability and more computational power provided by multiple GPUs.

Figure 1 :
Figure 1: The overview of our system.

− 1 𝐽𝑛𝑡
is the inverse of binding transformation of the joint   , and    represents the transformation of the joint   from the current frame of the animation. is the transformation representing the instance's global position and orientation.Note that the transformations  −1   ,    , and  are represented in the form of 4×4 matrix.The weight    is a single float value, and the four weight values must sum to 1.

Figure 3 :
Figure 3: An example showing the process of unfolding a cube mesh and mapping it into a flatten patch in 2D texture space.(a) is the 3D cube mesh formed by 8 vertices and 6 triangles.(b) is the unfolded texture map formed by 14 pairs of texture coordinates and 6 triangles.Bold lines are the boundary edges (seams) to cut the cube, and the vertices in red are boundary vertices that are mapped into multiple pairs of texture coordinates.In (b),   stands for a point (  ,   ) in 2D texture space, and V  in the parenthesis is the corresponding vertex in the cube.

InternationalFigure 5 :
Figure 5: An example of the rendering result produced by our system using different  values.(a) shows the captured images with  = 1, 6, 10 million.The top images are the rendered frame from the main camera.The bottom images are rendered based on the setting of the reference camera, which aims at the instances that are far away from the main camera.(b) shows the entire crowd including the view frustums of the two cameras.The yellow dots in (b) represent the instances outside the view frustum of the main camera.

Figure 7 :Figure 8 :
Figure 7: The performance comparison of our approach, pseudo-instancing technique, and point-based technique over different numbers of instances.Two values of  are chosen for our approach ( = 5 million and  = 10 million).The FPS is averaged over the total of 1,000 frames.

Table 2 :
Performance breakdowns for the system with a precreated camera path with total 30,000 instances.The FPS value and the component execution times are averaged over 1000 frames.The total number of triangles (before the use of LOD) is 169.07 million.Figure6:The change of FPS over different values of  by using our approach.TheFPS is averaged over the total of 1,000 frames.value of  limits the level of details that an instance can reach up.Even if an instance is close to the camera, it may not obtain a sufficient cut from  to satisfy a desired detail level.As we can see in the table, when the value of  becomes larger than 10 million, the ratio is increased to 94%.The View-Frustum Culling and LOD Selection components are implemented together, and both are executed in parallel at an instance level.Thus, the execution time of this component does not change as the value of  increases.The component of LOD Mesh Generation is executed in parallel at a triangle level.Its execution time increases as the value of  increases.Animating Meshes and Rendering components are executed with the acceleration of OpenGL's buffer objects and shader programs.They are time-consuming, and their execution time increases as more triangles need to be rendered.Figure6shows the change of FPS over different values of .As we can see in the figure, the FPS decreases as the value of  increases.When  is smaller than 4 million, the decreasing slope of FPS is small.This is because the change on the number of triangles over frames of the camera path is small.When  is small, many close-up instances end down to the lowest level of details due to the insufficient memory budget from .When  increases from 4 to 17 million, the decreasing slop of FPS becomes larger.This is because the number of triangles over frames of the camera path varies considerably with different values of .As  increases beyond 17 million, the decreasing slope becomes smaller again, as many instances including faraway ones reach the full level of details.