Nanomatrix: Scalable Construction of Crowded Biological Environments

We present a novel method for the interactive construction and rendering of extremely large molecular scenes, capable of representing multiple biological cells in atomistic detail. Our method is tailored for scenes, which are procedurally constructed, based on a given set of building rules. Rendering of large scenes normally requires the entire scene available in-core, or alternatively, it requires out-of-core management to load data into the memory hierarchy as a part of the rendering loop. Instead of out-of-core memory management, we propose to procedurally generate the scene on-demand on the fly. The key idea is a positional- and view-dependent procedural scene-construction strategy, where only a fraction of the atomistic scene around the camera is available in the GPU memory at any given time. The atomistic detail is populated into a uniform-space partitioning using a grid that covers the entire scene. Most of the grid cells are not filled with geometry, only those are populated that are potentially seen by the camera. The atomistic detail is populated in a compute shader and its representation is connected with acceleration data structures for hardware ray-tracing of modern GPUs. Objects which are far away, where atomistic detail is not perceivable from a given viewpoint, are represented by a triangle mesh mapped with a seamless texture, generated from the rendering of geometry from atomistic detail. The algorithm consists of two pipelines, the construction-compute pipeline, and the rendering pipeline, which work together to render molecular scenes at an atomistic resolution far beyond the limit of the GPU memory containing trillions of atoms. We demonstrate our technique on multiple models of SARS-CoV-2 and the red blood cell.


Introduction
The cellular mesoscale describes the scales that bridge the atomistic nanoscale and the cellular microscale [1].It starts from atoms that form larger molecules like proteins, up to the size where molecules further compose to form viruses, bacteria, or complex multi-compartmental cells.Visualizing biological structures in mesoscale helps biologists to analyze and understand the architecture and functionality of life forms.Mesoscale models of enveloped viruses reach several tens of millions of atoms in size, and these structures are just around 100 nm in diameter.Such a number only describes its macromolecular composition.If we would include water and small molecules, the num-ber of atoms would at least double.Larger models of bacteria or complex cells reach the range of billions or trillions of atoms.E. coli for example contains roughly 15 billion C, N, and O atoms (excluding hydrogen) that form its macromolecular composition.Larger structures, such as the red blood cell (RBC) consist of trillions of atoms.Each RBC consists of two-thirds of hemoglobin molecules, which are circa 250 million in each RBC.Just the position (3× float) and rotation (4× float) information of each hemoglobin instance in a single RBC will amount to 8 GB of data in storage.Storing atomistic information about a complete RBC, including its lipid bilayer membrane, reaches or exceeds the limits of the memory capacity of current consumer GPUs.When attempting Figure 1: Red Blood Cell model of diameter 8 µm containing approx.250 million of hemoglobin molecules, with lipid-bilayer and membrane-bound proteins (approx.1.2 trillion atoms in total) constructed and rendered with our view-guided two-level Nanomatrix approach.To the right of the cell there is a model of SARS-CoV-2.The rendering exploits hardware ray tracing, maintaining highly interactive framerates.
to store atomistic-detail information about a larger cell with 50 µm in diameter, the required memory will raise to petabytes.The huge amount of data involved in mesoscale systems poses a challenge for visualization due to hardware limitations.
Current technologies are able to render biological structures with up to billions of atoms.Our goal is to push this limit and effectively visualize models that contain trillions of atoms.Out-of-core approaches usually render massive scenes by keeping a fraction of data in the core memory and loading data from the disc when needed.Instead, we propose a view-guided procedural approach that generates the scene on-demand on the fly.The whole enormous scene will be never completely stored in the memory, only a fraction of it which is close to the viewer.We use a regular grid that uniformly divides the scene into cells.Only cells that are close to the viewer are populated with atomistic/nanoscale geometry, while the remaining cells use an image-based approach to depict the detail, which provides the user with a cellular/microscale description of the far structures.The main contributions include: • For the nanoscale level, we present new algorithms for the rapid construction of biological structures which can be directly applied to triangular meshes with arbitrary face sizes.• For the cellular level, we propose image-based Wang tiles, derived from geometric Wang tiles, to represent the structures that are far away using image to geometry alignment.• We propose an algorithm for dynamically managing the memory to change between the atomistic and cellular representations.• We propose a parallel rendering scheme that utilizes hardware-accelerated raytracing for molecular visualization.
• We demonstrate and evaluate the performance and capabilities of Nanomatrix framework compared to existing visualization methods.
In comparison to previous work, the construction and visualization of our approach are scalable toward larger structures.Previous work [2,3] uses specific packing algorithms that require the scene to be completely constructed before it can be visualized while in our method the scene elements are constructed and placed on demand in realtime.The visualization follows the same principle and offers geometric detail on demand for closer features and abstraction, in the form of textures, for distant structures.These new features allow users for the first time to interactively generate and visualize biological mesoscale landscapes in the size of a red blood cell in atomistic detail.Our view-guided construction approach facilitates the interactive exploration of large molecular landscapes.We expect our approach will be useful for communicating scientific discoveries and disseminating scientific knowledge to a broad audience.We see our applicability in a setup like science centers where we take the viewer to an interactive exploration of the molecular universe.Our approach is primarily targeted at biological systems but it can be applicable to any system that exhibits multiscale, multi-instances, dense, 3D nature.

Related Work
core techniques [4,5] that store the scene in slow external memory and transmit only currently required data into the fast internal memory.The input/output communication between the internal and external memory is the bottleneck of such out-of-core approaches.In this paper, we are using an exclusive in-core approach and trade memory with computing.Instead of fetching the data from external memory, we construct it on the fly when it is needed.On-the-fly construction requires very efficient techniques to compute and visualize procedurally generated scenes in real-time.
In this section, we review related work from procedural modeling, texture synthesis, molecular visualization, and parallel rendering.
Procedural Modeling: Generating 3D digital content is a very time-consuming and tedious task, while many environments contain self-similar and repetitive structures [6].
Procedural modeling techniques offer a way to create 3D models from sets of rules or algorithms.Very early approaches like L-systems use formal grammars to describe natural patterns as they appear in trees, for instance.Other phenomena like fire, water, gases, and clouds have also been procedurally generated for decades.In recent times, computer graphics has experienced significant advances in the procedural modeling of natural as well as man-made structures.Computer games and movies utilize procedural modeling techniques to create large worlds with varying shapes and styles.A fundamental part of such worlds is the automatic modeling of architectures [7] with different building designs.Such techniques are capable of generating infinite cities [8] or rich forest scenes [9] in real-time.In order to generate detailed content interactively, the research focus has shifted towards parallelization [10,11], such that computation and memory can be efficiently mapped on graphics hardware.While most procedural modeling techniques target open worlds, in this work, we focus on a very constrained space where the content is defined through scientific measurements.Procedural modeling in a mesoscale environment is a challenging task.This environment is highly dense with heterogeneous molecular structures in size and shape.There are several modeling techniques that target molecular landscapes, one of which is CellPaint [12].CellPaint allows users to create dynamic molecular illustrations using a 2D style, which shows a cross-section of a biological mesoscale scene.Mesocraft [13] allows the user to interactively specify a set of rules that define the spatial relations of the model's molecules and propagate these rules through the model which results in rapid modeling.Mesocraft and CellPaint are semi-automatic modeling techniques where the user is expected to participate in the building process of the biological structure, for example, changing the position of a molecule or rotating it.In contrast, our construction approach is fully automatic.Once the user provides the algorithm with the required input, the model is generated without any user interaction.
Atomistic modeling in mesoscale environments is typically based on packing algorithms, as demonstrated with cellPACK [3].Packing molecules is a computationally demanding task.Assembling a 3D mesoscale model from scratch via cellPACK could take from minutes to hours, depending on the size and complexity of the model.Klein et al. [2,14] propose the instant construction approach to rapidly create mesoscale models.It consists of a set of GPU-based population algorithms which generate different types of biological structures.The approach is limited to structures that fit into GPU memory and focuses on experimenting with different versions and ensembles of mesoscale structures.Our view-guided construction approach overcomes this limitation, by populating structures on the fly once visible.This opens the door to visualizing larger structures like a whole RBC or even infinite molecular landscapes.Inspired by Klein et al. [2] work, we use the Wang tiles concept to populate the biological structures.
While their tile mapping approach is limited to equallysized quad-based mesh, we propose a new algorithm that can map tiles into a triangle mesh of varying face sizes.Moreover, we propose a novel algorithm for constructing the inner part of the biological compartment by filling the space with collision-free 3D tiles which eliminates the need for real-time collision handling.
Texture Synthesis: Our approach of generating patches of geometry is very related to the synthesis of textures, especially in the context of Wang tiles.Originally, the Wang tiling concept [15] has been proposed as a formal system to cover an infinite plane with non-periodic patterns from a small set of tiles.The concept is based on tiles with colorencoded edges, whereas each tile is arranged in a way that its edge color matches the adjacent neighbor.Later adaptions of this approach have been used to map 2D textures onto 3D geometry.Fu and Leung [16] have extended the concept to apply Wang tiles on arbitrary topological surfaces.Li-Yi [17] avoid storing large textures on graphics processors by presenting a Wang tiles-based texture mapping algorithm that generates large virtual textures directly on the GPU.Culík and Kari [18] introduce Wang cubes with color-encoded faces which is a generalization of Wang tiles in 3D.Doškář et al. [19] use Wang cubes to generate compressed representations of complex microstructural geometries.Considering that such approaches are constrained to map color information to tiles, in this work we target to map geometry to tiles.The major challenge of the application of Wang tiles lies in the generation of the tiles, as well as handling the issue of overlapping geometry in the 3D case.Comparable approaches with geometry lack performance or use regular patterns [20].

Large-scale Molecular Visualization:
There is a variety of tools for molecular visualization, such as VMD [21], or PyMOL [22].These tools have been designed for molecules with up to thousands of atoms and are not suited when the data size exceeds tens of millions of atoms [23].
Similarly, generic visualization tools like Amira [24] prioritize generality over scalability and thus reach their limits with increasing dataset sizes.Waltemate et al. [25] map lipid bilayers onto a mesh geometry with interactive ren-dering performance for up to 10 million atoms on NVIDIA GeForce GTX 770.Mol* Viewer [26] presents a powerful web-based application for visualizing molecular data.It is able to render an HIV model that contains 67 million atoms smoothly using a NVIDIA RTX 2060 graphics card.Megamol [27,28] is a visualization framework designed to address interactive visualization of large particle-based datasets.The system is able to render up to 100 million atoms at interactive framerates, which represents in biology scale a virus or a small bacterium.
Recently, Ibrahim et al. [29] introduce a probabilistic occlusion culling architecture using meshlets for acceleration.
The algorithm was able to render 232 million particles in 41 frames per second (FPS) on an NVIDIA RTX Titan 24GB.Lindow et al. [30] have first presented interactive visualization of large-scale biological data that consists of several billions of atoms.As the biological models often consist of a large number of recurring molecules of a few proteins, they create for each protein a 3D grid structure containing all its atoms and store the grid on the GPU as a 3D voxel and then utilize instancing to repeat the proteins in the scene.In the rendering stage, they draw the instances' voxel and perform raycasting in the fragment shader.Their method was able to render 4,025 microtubules with approximately 10 billion atoms with at least 3 FPS on a NVIDIA GTX 285.Falk et al. [31] extend this work by optimizing the depth culling and the rendering method and they add an implicit Level of Detail (LoD) approach.They were able to render 25 billion atoms in 3.6 FPS using a NVIDIA GTX 580.
Le Muzic et al. [32], introduced an optimized approach using a straightforward LoD scheme that does not require grid-based supporting structures.Instead, it relies on the tessellation shader to dynamically inject sphere primitives in the rasterization pipeline for each molecular instance.Later, they extended their work and introduced cellVIEW [33] where they reduce the number of injected sphere primitives into the rendering pipeline.The cel-lVIEW system has set a benchmark with its ability to render 250 copies of an HIV virus model in blood plasma at 60 FPS on NVIDIA GTX Titan.This scene contains 16 billion atoms (each replica contains around 64 million atoms).The goal of our proposed approach is to push this limit and visualize a large biological model that contains trillions of atoms.All of the above-mentioned techniques use procedural impostors for representing atoms to simplify the geometry and accelerate the rendering [34,35].
The evolution in computational capabilities has reshaped the traditional perception of ray tracing, transforming it into a viable alternative to GPU-based rasterization approaches as the default method for visualization.The presence of highly optimized and extensible ray tracing engines, like intel OSPRay [36] and the GPU accelerated raytracing, has significantly influenced recent research.Many visualization tools, like ParaView, VMD, VisIt, and MegaMol have adapted intel OSPRay as rendering backends.OSPRay [36] is a scalable, CPU-based ray tracing rendering library built on Embree [37] and can visualize massive data as long as it fits the available CPU memory.Recently, MegaMol implemented RTXPkD [38]; a GPU-based Particle Kd-Tree (Pkd) method using the Op-tiX GPU ray tracing framework.RTXPkD was able to render a dataset containing 1,56 billion spheres in 75 FPS on NVIDIA Quadro RTX 8000 with 48GB VRAM.Existing approaches can only visualize models that fit into the CPU and/or GPU memory, we overcome this limitation by interweaving the construction and the rendering.We propose a view-guided scene construction technique that constructs a fraction of the scene visible from a given viewpoint that can fit the dedicated video memory.We adopt an RTX ray-tracing pipeline which results in rendering performance speed up by 2-3 orders of magnitude as compared to previous rasterization approaches.
Parallel Rendering: Parallel rendering aims to improve the frame rate by dividing the workload between multiple renderers.Generally, two of the three rendering architectures classified by Molnar et al. [39] have taken the most attention.The first category, so-called sort-first, involves dividing the screen into disjoint regions and each renderer is responsible for all computations that are related to its region.Each renderer accesses its own copy of the scene, making memory management inefficient.Another rendering architecture category, so-called sort-last, divides the scene's primitives rather than the screen.Renderers compute a full-screen image of their portion of the primitives and then submit these pixels to compositing processors that sort the rendered images in visibility order to produce the final image.This method can result in a very high data rate as the renderers operate independently until the compositing step.The sort-last strategy is suitable for applications that require interactive, high-quality rendering and efficient memory management.For more details, we refer the reader to [39][40][41].Our rendering algorithm uses the sort-last approach.Zellmann et al. [42] investigated how data could be efficiently partitioned across the parallel GPUs using a ray tracing-based renderer.They define the properties of good primitives partitioning; first, even data distribution, to make good use of the GPU memory and compute resources.Second, even spatial distribution because irregular spatial distribution may cause load imbalance and subsequently inadequate rendering performance.In addition, minimize object overlap and replication because in both these cases, the ray tracer would intersect the primitive more than once.Our method uses a spatial partitioning scheme to distribute space equally among cells.Each primitive in the scene belongs to only one cell, so no primitive is replicated.Parallel rendering usually utilizes multiple GPUs to accelerate the rendering; however, our rendering algorithm is executed in a single GPU that runs multiple rendering threads in parallel through a compute shader.

Technical Overview
None of the currently available visualization methods is able to visualize large macro-molecular structures such as In real-time rendering, these tiles populate the scene with the appropriate level of detail depending on the distance from the camera.The scene is then rendered using RTX ray tracing so that the population and rendering fully utilize separate computational units on the graphics hardware.
the red blood cell (RBC) in full atomistic detail or multiple instances of viral and bacterial ultrastructures.These structures are simply too large to fit into the CPU and/or GPU memory along with associated acceleration structures.At the same time, at any moment, only a small fraction of such a huge dataset can be seen.Therefore, a natural memory-saving and acceleration strategy is to store only those parts of the model in the memory that are visible from the current viewpoint.One solution is to use an outof-core approach and stream the model from the disk to the core memory.However, instead, we are using an exclusive in-core approach and generating the model on the fly, where the camera triggers the process of generation.
Our approach operates under the following condition: Biological structures that are far away are enclosed by their molecular envelope, such as a lipid bilayer for example.Far-away structures are never cut open or clipped in the middle, this happens only when the camera is close to a particular structure, only then the envelope can open to see the atomistic detail inside.
Before we dive into technical detail, we clarify the terminology of the individual components involved in our approach (Figure 2).Our Nanomatrix approach generates geometry at atomistic detail of the potentially visible portions of the view.By potentially visible we mean structures that are either (1) close to the camera, so that the atomistic detail becomes discernible or (2) there is a clear view of these structures unobstructed by other densely packed molecules.Our approach also generates image texture representation for those structures that are potentially visible but are so far away from the camera that their atomistic detail is no longer perceivable.To realize this, four elements are expected as input.First, the structure of all types of molecules that will be populated in the scene is stored in PDB files.These files are available at the Protein Data Bank (wwpdb.org).The second input type is the 3D patch, which is a small, collision-free 3D biological model that is constructed based on biological rules [13], which define the concentration of various molecules in the patch and the principles that spatially characterize their spatial relations.These can be created in multiple ways, e.g. using the MesoCraft tool [13] or some other alternative.Our approach uses two types of patches: the rectangle-based patch and the box-based patch, both are illustrated in Figure 3.The third input is the 3D mesh which defines the geometry of a given biological compartment.The scene may contain several copies of various biological structures.The information that is needed to instantiate the given meshes is given in the scene skeleton, which is defined as an external file.
Once the input files are given, the pre-processing phase starts preparing the necessary components for the construction and rendering phases.As we aim to visualize models that do not fit into the memory, we need to partition the space and keep only potentially visible parts of the scene in the memory.The scene partitioning divides the 3D space into small non-overlapping cells of identical but customizable sizes, along each axis that all together form the regular scene grid.Unless explicitly stated in the remainder of the paper, we refer to the cells as the elements of the scene grid as a technical term.Our visualization is concerned with biological cells, but if this term occurs in that context, it will be explicitly denoted as being a biological entity.Also, the cell is an element of global scene partitioning, in contrast to the box or rectangle tiles, which are elements forming the detail of a particular biological entity.Each cell is associated with an index which defines the cell location (i, j, k) within the grid.At any moment, only a small group of cells will be populated.We call this group of cells the active cells.To identify them, the camera's location within the grid is obtained first, which represents the central cell.This cell, together with its neighboring cells, represents the active cells.The number of active cells depends on the size of activation window, which indicates the number of central cell's neighbors that should be considered active.Each active cell points to a cell cache which is a GPU storage buffer that is readily available to be filled with geometric instances forming the structural information of a biological entity.
Once a cell is selected to be active, Nanomatrix generates the cell contents on the fly.The proposed construction approach uses the Wang tiles concept [43] for populating the cells with geometries from a set of geometric tiles.While the original Wang tiles algorithm covers an infinite plane with a virtual texture generated from a small set of image tiles.We introduce a novel concept of aperiodic geometric tiles (geometric texture), denoted as Geometric Wang tiles or GW-tiles for short.While the image tile consists of an array of pixels, geometric tiles consist of an array of instances each associated with a transformation matrix.Every Wang tile is defined as a square with colorencoded edges.These colors restrict how the tiles can be placed during the tiling process to form a seamless tiling.In the pre-processing step, a description of such arrangement is created and denoted as seamless tiling recipe (TR).This recipe is described as a 2D table of pointers that refer to one of the generated GW-tiles.
For biological structures that are far away from the camera and cannot be observed closely, the detail is represented by an image texture, instead of a geometry texture.Therefore, once a GW-tile is obtained, its geometry is used to generate the packed texture map where we pack the Wang tiles' textures into a single texture map.During the run time, for each texture request, we first determine which tile that fragment is located at, based on the position of its uv texture coordinates and the tile recipe.We then compute the relative offset of uv within that tile and fetch the corre- sponding texel from the packed texture map.The resulting texture rendering is composed of diffuse, normal, depth, and ambient occlusion textures which can be used in the deferred shading step when mapped on a mesh.
During the rendering, when the camera moves to a new location and if necessary the cell cache manager will update the active cells, as well as the pointers to the cell cache, and then submit the cells that need to be generated to the atomistic models construction.This procedure populates the proteins in the cells using GW-tiles and seamless tiling recipe.The constructed scene is rendered in two passes.In the first pass, each active cell computes a full-screen image of their portion of the atomistic models using hardwareaccelerated ray tracing that is available on GPUs.The resulting images are then composited in the second rendering pass to form the finally rendered image.During the compositing rendering pass, if the ray hits an atom in one of the active cells' image buffers, the closest hit is used to assign the final color and the ambient occlusion value is computed.Otherwise, the packed texture map is used to texture meshes in the scene.

Pre-processing Phase
In the pre-processing step before the rendering is started, we partition the scene and fill the scene with mesh instances.We also prepare the geometry for all meshes and prepare and upload buffered data to the GPU.

Virtual Scene Partitioning
The entire scene is partitioned into several cells which are filled during real-time rendering with structures on demand.These cells are organized in a grid that covers the entire scene.To create the grid, the first task is to define the axis-aligned bounding box (AABB) that tightly encloses the object distribution in the scene.These objects represent biological structures, such as biological cells, viral particles, bacteria, or organelles, which come in different sizes, and shapes.We use 3D meshes to define the boundaries that separate the internal parts of these structures from the outside environment.As the scene may consist of several copies of the same structure, the scene skeleton file contains information needed to instantiate the given meshes in the scene.This file contains a list of mesh instances and for each instance, the mesh-id, position, and rotation are provided.The meshes and scene's skeleton are together used to estimate the scene's grid AABB.
Since we are targeting to render large scenes, partitioning the space into cells will yield a large number of cells and correspondingly large memory requirements.We avoid that by using the uniform space subdivision scheme.For more detail check the supplementary material.a pointer to a tile that is located at the 2D index replication area; which is a 2D/3D array of tiles that are used for covering the entire triangle/cell.These tiles are selected based on T R in the case of triangle tiling while they are randomly selected in the case of cell tiling.We use superscript R and B to differentiate between the replication area of rectangular tiles and box tiles, respectively.

Tiles Preparation
Our strategy for filling a certain 3D volume with particular molecular instances is by filling it with a limited collisionfree set of 3D tiles that are prepared and stored during the pre-processing step, intended for use during real-time rendering.With such a construction approach, we overcome the computational load needed for populating huge numbers of elements on the fly.Moreover, this construction minimizes the collisions of populated elements.

3D Patches Generation
First, a 3D patch populated with molecules is generated.
For this purpose, we use the MesoCraft tool that was designed for generating biological assemblies based on the simple geometrical rules that define relations between elements, and molecular structures in our case [13].The rules description is out of the scope of this paper, however, it is important to mention that MesoCraft integrates the collision handling algorithm.Therefore, the resulting patch is collision-free.We work with two different patches; rectangle-based and box-based.A rectangle-based patch is formed by molecular geometry populated on a rectangle.The molecular distribution is with a reference to a plane and is later used for populating a membrane.The second type, a box-based patch, is formed by populating molecular geometry inside a box (distribution in a unit of volume) and is typically used for populating soluble proteins inside structures.We create box patches with the dimension of 100 × 100 × 100 nm filled by approximately 1,000 protein structures.The comparison of both patch types can be seen in Figure 3.

Generating Rectangular Wang Tiles
Placing these patches directly next to each other to populate the mesh will potentially create an overlap on the edges of adjacent patches.LipidWrapper [44] deletes overlapping molecular instances and fills the resulting holes with new instances.This is an expensive process which makes it unsuitable for the interactive scene population.Klein et al. [2] avoid that by applying the Wang tiling con-  Later in the population phase, if two Wang tiles are laid next to each other so that the triangular sub-patches on both sides of the shared edge are from the same base patch (are associated with the same color), they do not create a seam and there is no need to solve the collision.We work with basic 16 configurations of colors for Wang tiles (can be seen in detail in Figure 5).However, more configurations can be taken into account.In general, if there are |K| colors then there are 2 × |K| 2 combinations of colors tiles [45].However, less number of tiles could still be used to create aperiodic tileset [46].The more configurations are used, the bigger the diversity of the generated result.
Rectangular Wang Tile Mapping After the geometric rectangular Wang tile-set R is created, we need to define a way for mapping a molecular instance m ∈ r, where r ∈ R, from the tile into the mesh.To achieve that, we need first to identify the location of that instance within the tile m µγ .The µγ-coordinate is within [0, 1] space and we call it the tile coordinates.It can be computed by applying min-max normalization that rescales the instance position from the world coordinate (m xyz ) into [0, 1].
Klein et al. [2] propose per face tile mapping approach.
They use the quadrilateral coordinates to map the instance m from Wang tile to the mesh quad.Therefore, their method required the mesh to be defined as an equallysized quad-based surface which is a significant constraint as it can be difficult to create such a mesh for an arbitrary shape.In this work, we propose to rather use triangular mesh texture coordinates for the mapping, which enables us to map the Wang tiles onto triangle meshes of varying triangle sizes directly.
Our method expects a triangular mesh to be given as input, where every triangle is associated with the texture coordinates as well.We expect that the mesh already contains texture parameterization and both mesh uv-texture coordinates are within [0, 1] range.The algorithm uses the texture coordinates for transforming molecular elements m from tile coordinates to mesh texture coordinates and vice versa.If the mesh does not have a texture parameterization, a simple cube-map or spherical texture parameterization can be applied and used.However, depending on the shape of the mesh, it might be non-trivial to create fully seamless texturing.A seam in texture parameterization would result in a visible seam.The scene can contain multiple instances of different meshes.For simplicity, in the rest of the paper, we refer to a single mesh.However, the method works in the same way for every mesh that has a texture parameterization.
As we are aiming to run the population phase in parallel, we use the largest triangle t big of the mesh T to define the mapping from the mesh texture coordinate system into the tile coordinate system.This mapping is essential for thread-count estimation, as explained later in section 6.All the tiles in R are of the same size.We compute the ratio between the size of t big and one representative tile r.
The ratio represents the number of tiles needed for tiling to cover the entire area of triangle t big with a sequence of tiles in its plane which denotes the dimensions of Replication Area (rep).It is calculated as, Moreover, as the mesh is texture parameterized, the mapping (r.size uv ) that represents the size of a single tile in the texture coordinate space associated with the mesh, is computed by dividing the size of the triangle t big in texture coordinate over the ratio (rep R .dim).
For any triangle in the mesh, we can now cover the triangle with Wang tiles such that the tiles with the same edge color are placed side by side to create seamless tiling within that triangle.To populate all triangles in parallel, we need to know in advance which part of the mesh is covered by which Wang tile.Therefore, we generate a tiling recipe (T R) from the tile set R that is associated with the mesh.
A tiling recipe is a 2D array that contains indices of Wang tiles from R (in our case a value from the range [0. .15] as we work with 16 tiles) and represents a lookup table when populating the respective part of the mesh during real-time rendering.The tiling recipe is designed to be large enough to cover the full texture coordinate space [0, 1].
Its size can be computed as the inverse of r.size uv .Once the size of tiles recipe T R, which is a 2D array, for the entire mesh is defined, it is filled with the indices of tiles from R by the Wang tiling generator.This structure is prepared for later sampling to determine a tile at an arbitrary texture coordinate that belongs to the mesh.By dividing a uv mesh texture coordinate by r.size uv , we get a twodimensional index into T R. The other way around, if we multiply T R index by r.size uv , we get the position in the mesh texture.

Generating Box Tiles
Biological models are typically packed with proteins, nucleic acids, and other molecules.Soluble components fill the inner part of the biological compartments.The population task is to distribute the soluble ingredients spatially while avoiding overlaps.CellPack [3] positions the soluble ingredients sequentially one by one while avoiding collisions which is an expensive process that makes it unsuitable for the interactive scene population.Instant construction [2] uses parallel processing to increase the performance through consecutive steps where the space is first filled with instances through parallel threads while ignoring the overlaps.Then the overlaps are detected in parallel and resolved.Due to the density of biological models, the collision-resolving process may not converge and has to be terminated after a certain number of iterations.Instead, we propose to populate the soluble components by filling the space in parallel with collision-free 3D tiles which eliminate the need for real-time collision handling.
Wang cube [18] is an obvious generalization of Wang tiles to three dimensions.A Wang cube is defined as a cube with colored faces.Although it is possible to generate Wang-cube-tiles from the box-based patch to later place them in the space based on Wang's tiling concept to create a seamless 3D geometric texture.We use the box patch as a box tile to populate the scene space directly.We do not use a set of Wang cubes because we do not see a necessity to do that in our target scenes.After replicating several molecular instances next to each other, our experiments show that the seams, resulting from periodic tiling, are not noticeable for box tiling.The reason is that this tiling is only used for filling the internal part of a biological structure.After the camera penetrates inside the structure, due to the densely populated environment, the user is immersed among multiple structures, which leaves a limited possibility to identify the seams.Unlike rectangle-based patches, which form membranes that the user can see from the outside, box-shaped patches are not visible from outside, only when penetrating the biological structure with the camera.Therefore, we do not additionally process the box-shaped patches and we use them directly to tile the 3D space.
Box Tile Mapping We need to define a way for mapping a molecular instance m ∈ b, where b ∈ B, from a box-tile into the space.For that, we need to identify the location of that instance within the box-tile m µγλ .Again, the tile coordinate can be computed by applying min-max normalization that rescales the instance position from the world coordinate (m xyz ) into [0, 1].To fill a cell with box tiles, we need to compute the ratio between the size of the grid cells ĉ.size xyz and the box tile size.As all the box tiles in B are of the same size, only one representative tile b is used to compute this ratio, It represents the number of box tiles needed for tiling to cover the entire area of the cell.

Packed Texture Map Generation
Scene structures can be viewed from a large distance, where the atomistic detail would gradually result in seeing variations of colors on a mesh surface, instead of recognizing any detailed geometry.Therefore, when the biological entity is far away, it is covered with an image texture instead of a GW-tile.However, we need to maintain correspondence between GW-tiles and the texture map.When zooming in, the rendering algorithm combines the texture mapping with atomistic detail and blends between them.
To reduce the additional handling overhead during rendering, we create a single texture that contains all the tiles.Importantly, we need to correctly sample the values at the border of the tiles, which implies a special construction of the texture map.
We create a single texture map directly out of the GW-tiles.The texture map is not used in its continuum for texturing, it rather contains 16 texture tiles packed for sampling, with a space between these tiles to secure correct sampling on the border of each tile.We render a geometry patch from the top view along the y-axis such that the patch is aligned with the xz-plane.This patch consists overall of 9 × 9 GW-tiles which are placed in a seamless way between them.However, for sampling the texture map later in the real-time rendering stage, we only need 16 tiles depicted in Figure 6.The texture map is generated as follows: All 16 GW-tiles are positioned by facing the y-axis into four rows and four columns with a gap between the tiles of the size of one tile in both x and z directions.Then, those gaps are filled in the first iteration in x direction and in the second iteration in z direction to complete a seamless Wang tile pattern.In the last iteration, a set of tiles is placed to create a border enclosing the previously placed tiles.Such a structure allows us to easily identify the coordinates of individual 16 base tiles, which we use for sampling, based on their indices.For example, if the resulting texture map is rendered with the resolution of 1800 × 1800 pixels (meaning one tile is 200 × 200 pixels as this is 9 × 9 grid of tiles), the square in the texture (see Figure 6) representing the tile with index 6 is at the position [600, 1000, 200, 200]px (top, left, width, height).

Cell Cache Management
Only a portion of the scene geometry content is available in the memory at any given time during the real-time rendering stage.We achieve this using uniform space partitioning into cells.Next, we need to identify which cells among the scene's cells should be visualized and stored in the memory.We denote this scene's cells as the active cells.
In our viewpoint-guided approach, the camera position is used to identify which cells should be active.Therefore, we first define the central active cell, which is the cell enclosing the camera.Its (i,j,k) index within the scene grid G can be calculated as, where V xyz is the camera viewpoint.After the central cell is identified, the neighboring cells are obtained.Thanks to the regularity of the grid, the adjacent cells of the central cell are easy to locate.Selecting which neighboring cells should be activated could be done based only on the camera position and activate the closest neighbor cells to the central cell.Another approach is to select them based on both the camera position and direction by activating the closest neighbor cells to the central cell that intersects the view frustum.For simplicity, let us assume that we choose the first approach and we activate only the first closest neighbor to the central cell in each axis i, j, and k, so the size of our activation window is (3 × 3 × 3) which gives us 27 active cells C. However, based on the computational resources and settings of the size of the cells, the size of the activation window can be set to a larger number.
Increasing the size will increase the rendering overhead because each cell is drawn in a separate draw call, thus a larger number of images will need to be rendered for the final scene compositing.
The size of the activation window specifies the number of cells that will be populated and rendered.Subsequently, it specifies the number of cell cache buffers that need to be prepared.The cell cache is a GPU storage buffer that is readily available for the active cell to be filled with populated instances.This memory buffer is pre-allocated to fit a relatively large number of instances.In the prepossessing step, we allocate 27 storage buffers that represent the cell cache.As our scene is continuously regenerated, we choose to allocate cell cache in advance and just fill and clear them in real-time to avoid the overhead that comes from the frequent memory allocation and deallocation.
The cache manager controls the process of reusing deactivated cells by updating the pointers between the cell cache buffers and active cells and triggers an event that leads to the regeneration of the scene inside newly active cells.Figure 7 shows an example that illustrates this algorithm on a 2D grid.In this 2D example, the scene's grid contains 16 cells and the size of the activation window is 3 × 3.
In Figure 7 (a), the camera is located in cell 4, which becomes the central cell and the cells 0, 1, 4, 5, 8, and 9 are the neighbors and all of them are inside the activation window.These cells are the active ones and should be populated.Every active cell should occupy a cell cache to fill it later with molecular instances in the construction stage.Once a cell becomes active, an unoccupied cell cache will be reserved for this cell and a pointer will be created to link them.This cell will be added to the list of cells that will be submitted to the construction stage.If the camera moves to the cell 9 as shown in Figure 7 (b), the cells 6, 10, 12, 13, and 14 entered the activation window and need to be constructed while the cells 0 and 1 left the activation window, therefore, their pointers to the cell cache have been deleted.This makes these cache buffers available for other cells.Cells 4, 5, 8, and 9 were populated previously, and they are still pointing to the same previous cell cache buffers.These pointer operations are important to avoid copying between buffers.In other words, if a cache has been assigned to a cell, it will be reserved for that cell as long as the cell is inside the activation window.
This cache management approach is applicable to any activation window layout and size.In our implementation, we have tested different possible layouts/sizes illustrated in Figure 8.

Occlusion Management
Biological models are densely packed with molecules.When the viewer enters such a dense world, an occlusion management technique has to be employed to visualize the model properly.In Nanomatrix, we employ a simple object-space clipping localized around the camera that discards elements according to their distance to the clipping geometric shape.This geometric shape, which is a sphere in our case, specifies a region in object space that is influenced by the clipping.Besides managing active/inactive cells, the cache management invokes the instances visibility update when the camera moves to determine which instances within active cells should be shown or hidden.
It tests the intersection between the instances bounding sphere and the clipping region to find out which ones lie inside the clipping region and sets them to be hidden.To accelerate the computation, the visibility test is done in a compute shader and for only the content of active cells that intersect our clipping geometry.

Atomistic Models Construction
In the pre-preprocessing phase, both rectangular-based and box-based geometry tiles are prepared.For their population, we implemented two distinct population methods.The first method is designed for the membrane population using rectangular Wang tiles, the second method is for the population of the inner matrix of the biological compartment using the box tiles.Both methods are described in this section.
A membrane of a biological entity is typically a thin envelope that consists of the lipid bilayer and membrane-bound proteins.For representing an overall shape of the biological entity, we use a geometry mesh that can be created by 3D modeling tools or directly derived from biological measurements.In the following, an algorithm that populates molecular instances along the mesh is described.The population is done for every activation cell independently.Therefore, in this section, we assume that the active cell c ∈ C has been selected to be active and the task here is to populate this cell with geometries using rectangular tiles and box tiles.
During the generation of the cell, many molecular instances have to be processed.We designed the construction algorithm to process and populate these instances independently which makes it suitable for parallel execution on GPU.We estimate the number of parallel construction threads as the maximum number of molecules needed to cover the largest triangle in the mesh or a cell in the grid, depending on whether it is a membrane or solubles population.

Prepare Cell for Population
To ensure scalability, Nanomatrix is designed to prefer to recompute data over storing them.Therefore, no information about any cell in the scene grid is stored until it becomes active.Once a cell c is selected to be active, the assigned cache for that cell c.cache is cleared and some necessary information needs to be computed to prepare the cell to be filled with molecular instances.
We need to populate the rectangular/box tiles on/inside the mesh.As the mesh can be of arbitrary size and shape, the active cell c can be in three configurations: outside of a mesh, inside of a mesh, or intersecting a mesh.To populate it with molecular instances, we need to know first which mesh instances and triangles are intersecting that cell.We run an AABB-triangle collision algorithm in parallel to test for intersections between the active cell c and all scene triangles and then store the intersected triangles c.T and mesh instances c.H in the cell's cache c.cache.If no intersection is found, that means the cell c could be entirely inside or entirely outside the mesh, therefore, we need an in-out test.To achieve this, we find the closest triangle to cell c.closestT riangle as the closest distance between the center of a mesh triangle and the center point of the active cell c.We find the closest triangle by using atomic operation in the same parallel compute shader.

Membrane Population
Our approach to populate the membrane overlays each triangle of the mesh with a sub-set of T R, which represents the Replication Area (rep).Based on the texture coordinates we obtain the replication area that encloses the triangle t and then re-projects that area with its respective tiles onto the triangle in the 3D space.By dividing the triangle minimum texture coordinate t.min uv by the size of tile in texture coordinate r.size uv we get the two-dimensional index into tile recipe (T R xy ) that represents the starting corner of the replication area rep[0, 0] (see Figure 9.In this example, T R [1,1] represents the starting corner of the replication area.) Within this replication area, we populate all molecular instances in parallel.In our case, the replication area has always the same dimensions (rep R .dim)calculated from the biggest triangle of the mesh and we use it for smaller triangles as well.The replication area completely covers the triangle area, but tiles can also lie outside the triangle t area or outside the cell c.Overall, for every intersected triangle t ∈ c.T , we run (N R × |rep R |) threads where N R represent the maximal number of molecular instances that can be found in any rectangular tile r ∈ R. Threads are associated with a particular molecular instance m that belongs to a certain tile and is stored in a linear buffer.The remaining threads are discontinued.
In every population thread, the thread id specifies the molecular instance m and the two-dimensional indices within replication area rep R .We need to know the twodimensional index of the tile within T R that corresponds to rep R ij and from which the populated molecular instance m should be mapped.To achieve that, we can use the minimal uv texture coordinate of the triangle to get the two-dimensional index of the starting corner of the replication area within the tile recipe T R xy (see Figure 9-bottom).Then, by adding (rep R i , rep R j ) to the twodimensional index of the starting corner of the replication area T R xy , we get the index of the tile r ∈ R.
To correctly define the 3D position of the molecular instance m xyz within the tile r, we need to find its texture coordinate m uv to obtain the barycentric coordinates which will allow us to get m xyz through interpolation.
The instance texture coordinates m uv can be computed as follows; First, we add m µγ , which represents its position within the tile, to the tile two-dimensional indices (rep R i , rep R j ).We get the instance position within the replication area in µγ coordinate.Then, we can map that value to uv texture coordinates by multiplying that number by r.size uv .Finally, we need to shift that value to uv location that represents the origin of the starting corner of the replication area ( Figure 9-bottom).
Afterwards, we crop all the instances that lie outside of the triangle (using barycentric coordinates of the instance) and outside of the cell c by testing the intersection between the cell's AABB and m xyz .If the position m xyz passes the criteria, the atomicCounter is increased and a new molecular instance is recorded in c.cache[atomicCounter] = m.The newly populated instance m inherits all the features such as its molecular type, color, and position.The rotation is stored in an analogous way.The only difference with the rotation is, that it is adjusted by a rotation representing the rotation around the z-axis into the normal vector of the triangle.The z-axis is used as all the tiles were generated with the default orientation facing thez-axis.

Solubles Population
In the previous section, the description of the membrane population was discussed.The next step is to populate internal parts of the biological structures with molecular instances, not the external space on the other side of the boundary.Similar to the membrane population, this approach is based on tiles.However, instead of the rectangular-based tile-set R, the box-based tile-set B (see Figure 3) is used.In this case, the population does not rely on the texture coordinates of the mesh.Moreover, there is no Wang tiles approach used for the B tile set.As these are not visible from outside the structure and when immersed inside, it is very unlikely to notice any seams in such a crowded environment.Therefore, the seamless constraint is not applied in the box-tiling case.However, the same Wang tiles concept as previously explained can be extended to 3D and used for B tiles.In our implementation, we generate each box tile b ∈ B using the same size b.size xyz , limited to the cell c.size xyz .
Firstly, as previously mentioned, once the cell c is selected to be populated, the set of intersected triangles c.T is computed.Moreover, a list of intersected meshes c.H, specifying into which meshes the triangles c.T belong, is created.Afterward, c.closestT riangle (see subsection 6.Every thread is associated with a molecular instance m, where all instances are stored in a linear buffer of the boxtile b.For every instance, in a box-tile m, its relative 3D position inside the box m xyz is computed.To calculate the absolute position, we need to calculate the world space position of the starting corner of the box-tile b.This position is calculated from the starting corner of the cell c, the three indices x, y, z that refer to the respective box-tile b and the world-space size of the box-tile b.size xyz .Once the 3D position of molecular instance m xyz is calculated, it is tested whether it lies inside the cell c.Moreover, if the cell is intersected by triangles (i.e.c.T ̸ = ∅), the algorithm decides in which half-space with the respect to c.t the position m xyz is.This orientation is determined based on the normal vector of the triangle mesh, whether the normal points toward the m xyz wrt. the triangle center or away from m xyz .If m xyz is outside the biological structure, it is rejected and the computation stops.Otherwise, the atomicCounter is increased and a new molecular instance is recorded in c.cache[atomicCounter] = m.The newly populated instance inherits all the instance m features such as its molecular type, position, and orientation.For the case when there is no intersection with any mesh (i.e.c.T = ∅), the closest triangle information c.closestT riangle is used as an indicator, and the triangle normal of the closest triangle is again used to determine whether the entire cell c is inside or outside of the biological structure defined by the mesh.In case it is marked as outside, no population is performed.

Parallel RTX-based Molecular Rendering
The computational complexity of interactive ray tracing grows logarithmically with the complexity of the scene, which challenges the predominant role of rasterization in complex environments [47][48][49].Recent graphics card architectures support hardware accelerated ray tracing.We utilize this new technology to accelerate the rendering of our on-the-fly populated scene.

Acceleration Structures
The acceleration structure (AS) is a core component of every efficient raytracing algorithm.To accelerate the raytracing in the modern GPUs, this component is implemented in hardware.NVIDIA GPU hardware implementation exposes only two levels of acceleration structure to the user.The bottom-level AS (BLAS) defines the geometric description of a model, while the top-level AS (TLAS) consists of instances that associated with the transformation matrix, as well as a reference to one of the BLASes [50].
In our rendering algorithm, we are using two representations of acceleration structures: the cellular AS, which defines the skeleton of the scene and contains all the mesh instances that define the shape, size, and position of the biological structures, and atomistic AS which contains the atomistic description of the active cells.In cellular AS we create a BLAS for every mesh, while in atomistic AS we create a BLAS for every molecular model where we define the position and radius of its atoms, and then we instantiate them within the scene (see Figure 11).
Representing the atomistic AS as a single TLAS will lead to rebuilding it from scratch whenever the active cells are changed.RTX acceleration structure allows updating TLAS, which is cheaper than rebuilding it, however, it can be used only for updating the instances information e.g.transformation matrix.If a new instance needs to be added to the scene, the TLAS has to be rebuilt.To avoid that, our atomistic AS contains multiple TLASes, one TLAS per active cell.The active cell's TLAS is generated based on the contents of its cache.Once a cell becomes active, its TLAS will be built and will not change until that cell becomes inactive.
For each TLAS, hardware acceleration requires providing the type of ray intersection test needed by the traversal program.The selection should be based on the BLAS lowest geometric representation.In the cellular AS, the meshes are defined as triangles, therefore we use the hardware triangle-ray intersection test that is built-in in the GPU.In the atomistic AS the molecular models are defined through Figure 10: Overview of the parallel rendering pipeline.On the left side, the "sort-last" scheme of parallel rendering is presented which consists of two passes.The first pass is highlighted with blue boxes that trace the Atomistic AS TLASes in parallel while the yellow box represents the second pass that composites the 27 image buffers.On the right side, the corresponding high-level description of each pass.its atoms/spheres, therefore we use a custom-implemented sphere-ray intersection test.

Parallel Rendering
The scene is rendered as a combination of atomistic and cellular rendering.Details closer to the camera are rendered at the atomistic resolution, while objects further away from the camera are rendered as textures.To prevent the sudden popping of atomistic structures, we implement a smooth transition from texture details into atomistic details and vice-versa using alpha blending.In the rendering of the molecular details, the scene is rendered in two-pass rendering.In the first pass, the active cells are rendered separately into their respective frame buffer objects (FBOs).This pass takes the advantage of having multiple TLASes in the atomistic AS to parallelize their rendering.We use the sort-last parallel rendering scheme [39] that can render the atomistic AS TLASes in parallel, which results in a very high data rate as the rendering operates independently.But instead of parallelizing the rendering through multiple GPUs, we use one GPU that uses compute shader and Nvidia's GLSL EXT ray query extension to parallelize the rendering tasks between threads.The ray query extension allows us to invoke ray tracing queries through the compute shader.This extension is an alternative to the ray tracing pipeline, but no separate dynamic shader or shader binding table is needed [51].To our knowledge, no technical literature has reported using GLSL EXT ray query extension for parallel rendering so far.In the first rendering pass, the atomistic AS TLASes are traced in parallel, a thread per pixel per active cell.In each thread, once the closest hit is found, its information (e.g.depth, instance id, atom id) is stored on the full screen image buffer of the thread's active cell.Otherwise, the value (−1) is stored, which means the ray did not hit any nanoscale structure (see Figure 10).
In the second pass, the resulting FBOs are composited, based on the depth values to form the final rendered image.If there is at least one hit found in the 27 images, the instance id and atom id of the closest hit among them is used to get the molecular color.In addition, the shading is computed using Phong illumination model as well as ray-traced ambient occlusion.If the is no hit (no nanoscale structure information is found), the cellular structure information is provided using an image-based approach.

Image-based Tiling
Image-based impostors are usually used to avoid rendering objects that are far away from the viewpoint by replacing the geometry of these objects with a painted texture [52][53][54].In the tile preparation phase (see subsection 4.2), GW-tiles have been created.Moreover, the corresponding texture map was synthesized.The key idea is to use both of these levels of detail representations while rendering the scene.When the camera is closer to a biological structure, the GW-tile is used.Once the camera zooms out, which causes the atomistic detail to disappear, the corresponding part of the texture map is rendered in the very same place.In subsection 4.2 also the description of tile recipe T R was presented.The tile recipe forms a virtual map of tiles that covers the whole uv-texture space associated with the mesh.
The texture map is sampled while rendering a cellular mesh using the following approach.Using the texture coordinate of a fragment f uv , we first determine which tile it lands at, based on the position f uv within the tiles recipe.Moreover, the relative fragment positionf µγ inside this tile is computed.Based on the tile id, the tile's respective starting position tileT ex.min uv in the texture map is obtained.The resulting color is fetched for the position (tileT ex.min uv + f µγ ) from the packed texture map.Besides the diffuse color, the texture map consists of normal and ambient occlusion buffers which are used to add geometric detail to the shaded surface.
To avoid the sharp transition from geometry to image-based representation we apply alpha blending on the instances that are located in the far border of the neighboring cells which combines the geometry atomistic color with imagebased cellular color.Due to high-frequency components of the image-based texture, texturing far meshes results in aliasing artifacts.To avoid that, we use two-level hierarchy of colors for the meshes where the first is the packed Wang tiles texture while the other is a solid color and blend them based on the world space distance to the camera.

Results
Our novel construction approach is capable of instantly generating and visualizing biological worlds of cellular mesoscale.We demonstrate the scalability of Nanomatrix on a scene containing a red blood cell (RBC) and a SARS-COV-2 virion, as shown in Figure 1.
Instantly generating and rendering such a large scene in atomistic detail was not possible with previous methods.The bounding box size of RBC mesh is 64, 360×15, 110× 75, 410 = 73, 334, 686, 636, 000 Å3 .The generated boxtile size is 1, 000 × 1, 000 × 1, 000 = 1, 000, 000, 000 Å3 .Each box tile has approx.5, 000 molecular instances (hemoglobins of approx 4, 380 of atoms), with approx.21, 000, 000 atoms.Additionally, the surface area of the RBC mesh is 9.54061e + 09 Å2 .The area of its membrane GW-tile is 500×500 = 250, 000 Å2 .Each membrane GW-tile has approx.5, 000 molecular instances with approx 335, 000 atoms.Once the RBC would be fully generated, the entire scene would contain approx.518 millions of molecules with 1.2 trillion of atoms.One SARS-CoV-2 virion consists of approx.135 thousand molecular instances with approx.24 million of atoms (see Table 2 for more details).In our model scene, there are four RBCs and twenty SARS-CoV-2 particles which leads to approx.5 trillion of atoms in total.However, the algorithm is scalable enough to work with any number of non-overlapping models that can be fitted into the bounding box of the scene.The implementation of the approach was realized using the Vulkan API [55] and NVIDIA's nvpro-samples framework [56].The performance was measured using a NVIDIA GeForce RTX 4090 graphics card with 24 GB memory.In our experiment, we create a grid of dimension 200 × 200 × 200 with cell size 2000.Each cell cache buffer has room for 500, 000 molecular instances.These are programmatically adjustable, they can be set to meet the requirements of any dedicated GPU.The construction algorithm is able to populate a cell with membrane molecules in approx.6 ms and with soluble molecules in approx.3 ms.We implemented the approach in two environments.The first environment, where the rendering is built on top of the Marion library, is developed using C++ and OpenGL 4.6 graphics API [57].The second environment is built using the Vulkan graphics API with RTX functionality. Figure 12 shows our model scene rendered using RTX-based renderer.Whereas the construction algorithm has approx.the same performance (as in both cases it relies on the compute shader pipeline and not on the graphics pipeline), the rendering differs significantly.We are able to achieve on average factor 2.5 speedup using the RTX-based framework.On NVIDIA GeForce 4090 in full HD resolution with similar settings of the scalable construction algorithm, Marion rendering runs at approx.55 FPS (with drops down to 35 FPS).The RTX-based framework runs at approx.150 FPS (with drops to 110 FPS) with a single ray per pixel and 110 FPS (with drops to 80 FPS) with 10 rays per pixel.This speedup is sufficient enough to provide us with the future possibility of integrating a VR or AR interface, where a high framerate is the key factor for a satisfactory user experience.The performance of Nanomatrix can be further boosted by submitting the rendering and construction workloads into two different Queue Families in the GPU which allows the RT Core and compute workloads to be processed concurrently.This is a new feature that is supported by recent NVIDIA's graphics card architectures [58].

Discussion
We implement Object-Space AO (OSAO) to convey the shape and the depth of molecules by tracing random AOray against the TLAS of the active cell to which the hit primitive belongs.The AO algorithm is implemented as described in NVIDIA Vulkan API tutorials [59].Due to our griding approach, the AO is inaccurate on the prim-  itives that are located on the borders of the active cells.
To estimate the shading value correctly, the TLAS of the adjacent cell would need to be traced as well.This is one of the drawbacks of the sort-last scheme of parallel rendering [60].To overcome this issue, a test has been added in the AO computation, if the hit atom is located on the cell border, then the AO-ray will traverse the AS of all active cells that intersect the hit atom.
Selecting which neighboring cells should be activated could be done based only on the camera position or based on both the camera position and direction by activating the closest neighbor that intersects the view frustum.Figure 13 shows side-by-side comparison of the methods result.While the frustum-based selection method enriches the scene with geometrical information, the camera positionbased method could be more suitable in immersive environments, such as VR, where camera position/direction is continuously changing and using the frustum-based selection will cause more populating/destroying cells.
The instance count at the RTX acceleration structure must not exceed a specific limit which varies based on the graphics card.Our RTX 4090 graphics card supports a maximum 16M instances.However, the scalability of Nanomatrix allows overcoming that limit by rendering multiple accel-eration structures in parallel.The selected design for the atomistic acceleration structure (see Figure 11, top) has several additional advantages.In this acceleration structure, each BLAS defines the geometric description of a molecular model, while the active cells TLASes consist of instances associated with the transformation matrix, as well as a reference to one of the BLASes.This two-level hierarchy allows us to populate multiple instances of a molecule in several TLASes while storing its geometry only once in the GPU memory.In addition, it enables us to construct/destroy cells independently, which supports parallel cell populations.Another advantage of this acceleration structure design, it fits nicely with the sort-last parallel rendering scheme that accelerates the performance as the rendering operates on the TLASes independently.We have experimented with various ways for how the RTX's acceleration structure can be defined for Nanomatrix.We finally converge on an optimal solution with the proposed construction algorithm that leads to a high rendering performance while maintaining the memory requirement low and updatable.
Our method is scalable, its parameters can be adjusted based on the available computational resources.As the size increases, more geometrical information will be presented which enriches the scene with detailed information.However, it will increase the computational complexity and the memory footprint.Our method allocates a part of the dedicated GPU memory for the cells' cache buffers.
Clearly, increasing the cache buffer size will increase the allocated portion of the memory.On the other hand, increasing the size of the activation window will increase the number of the cells' cache buffers.In addition, the rendering overhead increases with the size of the activation window, because it requires more rendering threads in the first-pass rendering and more images to be composited in the second-pass rendering.Based on our experimentation, we have set up the cells size to 2, 000 Å, and we activate each time only 27 of them surrounding the camera.Subsequently, we use geometric representation for close objects and change to image-based representation if the distance to the object is more than 3, 000 Å (assuming the camera is in the middle of the central cell).These have been handcrafted and serve the test on our system, NVIDIA GeForce RTX 4090, which was able to run the framework at approx.150 FPS (with drops to 110 FPS) in full HD resolution.On another system, there could be another setup.It would be natural to design an adaptive approach that controls the selection of visual representation such that a desired rendering frame rate is secured.However, such automation is out of scope of this paper and would be an interesting future work investigation.
Our construction algorithm is meant for explanatory visualization of extremely large cellular mesoscale scenes that can be explored down to atomistic detail.The tiling strategy that we employ may be criticized for repetitiveness and associated plausibility of the resulting model.We want to emphasize that the explanatory visualization scenario, like an interactive show in a science center for example, allows for certain flexibility in terms of structural accuracy of the biological system that might or might not be acceptable within scientific discovery workflows.Presented biological scenes are accurate with respect to the concentration and the density of the structural composition.While there are no repetitions in depicted biological systems, our construction creates a seamless repetition of structures.As said, this is tolerable for broad audience use-case as it is the case in many graphics applications where Wang-tiling is frequently used.

Comparison to Existing Work
Nanomatrix is a view-guided construction system where the construction algorithm is paired with visualization that makes it possible to visualize multiple RBC instances in one scene (four in our example case For testing purposes, we exported a single cell of RBC of size 2000 × 2000 × 2000 Å, which contains approx.49.5 million atoms (shown in supplementary material).Firstly, we used off-the-shelf visualization tools Avizo [61] and ParaView [62].As the strength of these tools is in userdefined custom visualization of generic data, they are not optimized for our dataset and they crashed.
Then, we tested this model in the open-source molecular data renderers available in MegaMol.The RTXPkD [38] renderer which uses a kD tree adapted to particles with hardware raytracing and the multilevel culling variant by Grottel et al. [28], which is optimized for rendering large data.The results are shown in supplementary material.The left image shows the model in RTXPkD renderer which ran with 83 FPS while the middle image shows the result from Grottel et al. [28] renderer which ran with 177 FPS.However, testing 27 active cells in MegaMol was not possible because of the GPU memory limitations.Visualizing 27 copies of this model requires at least 17,5 GB of GPU memory, which does not leave much room for framebuffer, and other data structures necessary for the renderer.
Several molecular renderers, such as Marion [57], exploit instancing for visualizing large molecular data as the biological models often consist of a large number of recurring molecules.The instance-based scheme reduces the amount of required storage by defining molecules only once and instantiating them many times within the scene.However, these renderers still cannot visualize a single RBC.Saving only the position, rotation, and type of an instance require 29 bytes.A single RBC contains approx.518 million of instances of molecular structures.Therefore, it requires in the most optimal case approx.15 GB of GPU memory.However, the implementation requires more memory and therefore it is not possible to fit the model fully in.
For ray-tracing, additional BVH data further increases the memory overhead.BVH requires almost 3.5 times the raw data size [38], as a bounding box needs to be stored for each node in the hierarchy.Even with the instance-based scheme, none of the current techniques are at the moment able to render one RBC in a GPU with 24 GB memory.
The main idea of our approach to overcome the limitations (memory, amount of instances) is to "compute instead of store".The whole enormous scene is never completely stored in the memory, only a fraction of it, which is close to the viewer.The only approach that follows the same gist is the Instant Construction [2] approach, however, this algorithm is not scalable and cannot construct/visualize worlds that are larger than what can be packed into the GPU memory.They don't have any scheme that would be able to partition the scene and show it on demand.Our approach can be considered as a scalable version that is built on top of this non-scalable prior work.
All the in-core systems are limited by GPU memory.Popular tools such as ParaView, VisIt, and VMD support distributed rendering when data is too large to fit in a single GPU.Investigating this goes far beyond the scope of this work.These tools also integrate Intel OSPRay [36] into their systems.OSPRay [36] is a scalable, CPU-based ray-tracing library for interactive applications.It supports instancing and is designed for visualizing large data as long as it fits the available CPU memory.Recently, Intel introduced OSPRay Studio [63] which is an open-source and interactive visualization that leverages Intel OSPRay [36] as its core rendering engine.We implemented a plugin into OSPRay Studio for importing our molecular models to investigate the possibility of rendering our models using OSPRay.Performance was measured on Intel Xeon Gold 6242 2.80 GHz at HD resolution.
We exported the content of eight active cells of size 2000 Å of RBC model which consists of 1,322,381 instances (1,069.97 million atoms).OSPRay took approx.12.5 hours to build its internal data structure and to prepare the scene.It ran with 3.21 FPS.Based on Windows Task Manager, OSPRay Studio application was utilizing 18.398 GB of RAM and 93% of CPU.Our approach is able to construct the same model in approx.67 milliseconds and render it with 130 FPS.We exported the content of 27 active cells of size 2000 Å of RBC model.This model consists of 3,296,281 instances (1,748.9 million atoms).OSPRay took approx.3 days to finally render it with 1.14 FPS.The application was utilizing 45.537 GB of RAM and 93% of CPU.Nanomatrix is able to construct the same model in approx.97 milliseconds and render it with 100-85 FPS.
By integrating the construction into the rendering, we overcome the limitations of streaming pre-generated data to memory.It is essential to emphasize that the focus of this paper lies not only in rendering but especially on scalable construction, coupled with rendering.Our main contribution centers around the scalability of our approach to render four RBCs and possibly even many more.Notably, considering the microscopic scale of RBC is only 8 µm and there are many larger biological cells that need to be visualized.

Limitation
Unlike the existing approaches, our tiling approach is not restricted to mapping a tile into a single face of the same size.It can be applied on arbitrary meshes with arbitrarilysized triangles.The quality of mesh texturing plays the key role.In the areas of meshes with continuous texturing, our method populates elements with no visible seams.There still may occur seams mainly around the texture wrapping (see Figure 14).

Conclusions
This work presents a scalable approach for exploring cellular mesoscale down to their atomistic resolution.We introduce a view-guided construction algorithm based on the Wang tile concept.Our implementation uses Wang tile concept for construction of membrane structures of biological entities.Construction of internal parts of biological entites does not integrate the Wang-tile concept, but the extension is straightforward.The overall performance is interactive even for hundreds of billion of atoms.Currently, the box-tiling is not aperiodic as the seams are not critical.
However, in case we include the representation of linear strands of genetic macromolecules or other fibers, the continuity will become a clear requirement.As the approach which is used for rectangular tiling can be extended for 3D, we can include the fibrous structures as parts of the GW-tiles.If such approach would significantly increase the GPU memory requirements, an on-the-fly fiber population within active cells can be integrated building on the top of Klein et al. parallel fiber generation approach [14].The new active cells will need to populate the fiber from its last position from previous active cells.
Occlusion culling is a popular strategy for determining which scene to load or which scene to compute or which scene element to render.CellView [33] utilizes the hierarchical z-buffer for accelerating the rendering process however the largest object is a macro molecule.It would be interesting to investigate an occlusion culling technique that is designed for a higher range of spatial scales up to the entire cell and potentially beyond.Such an approach would be interesting for future work investigation.
Nanomatrix generates geometric representations of structures that are close to the camera and uses image texture representations for structures that are far away from the camera that their atomistic detail is no longer perceivable.The seam between the two representations exists because they are conceptually different.We apply alpha blending to reduce the seam, however, there are several ways that could be used to eliminate or reduce this problem.One way is by interpolating the appearance from one representation to another and smoothening it.for example, we can extend the validity of the geometry and draw for every pixel both texture and geometry at the same position and create a linear transition between them.Another way is by dithering the border, so instead of using linear interpolation, we make a decision with probability whether the texture representation will be used for that pixel or the geometry and by that the boundary would be totally smeared out.Another more expensive approach is increasing the perceived geometric complexity of image tiles by using displacement mapping [64].In this case, for each mesh face, an AABB represents the displaced surface needs to be stored as BLAS.Then, during the rendering, once the ray hits the BLAS, the algorithm should traverse the corresponding displaced surface through a custom intersection shader to compute the intersection.All these are valid approaches, but finding the optimal solution in terms of both visual quality and performance needs more in-depth investigation which could be the subject of a future work.

Figure 2 :
Figure 2: Nanomatrix -the scalable construction algorithm.Based on the input structure and building rules, in the preprocessing step rectangular geometry-based or image-based aperiodic patches are generated.For representing a volume of molecules, box tiles are generated.In real-time rendering, these tiles populate the scene with the appropriate level of detail depending on the distance from the camera.The scene is then rendered using RTX ray tracing so that the population and rendering fully utilize separate computational units on the graphics hardware.

Figure 3 :
Figure 3: Illustration of a rectangle-based patch (l) and box-based patch (r) for SARS-CoV-2 (top row) and RBC (bottom row).

Table 1 :
Symbols used in this paper Symbol Explanation H a set of mesh instances in the scene H = {h} T a set of triangles in the mesh T = {t} C a set of active cells C = {c} c.T, c.H a set of triangles/mesh instances that intersect the active cell c (where c.T ⊂ T and c.H ⊂ H) R a set of geometric rectangular tiles R = {r} B a set of geometric box tiles B = {b} r, b a representative tile where r ∈ R and b ∈ B. ĉ a representative cell.m a molecular instance m uv , m xyz , m µγ , m µγλ the position of molecular instance in texture/world or tile coordinates.We use subscript uv, xyz and µγλ to differentiate between the texture, world and tile coordinates, respectively.T R tiling recipe; which represent a 2D array of pointers where each element of this array refer to a rectangular tile r ∈ R. Conceptually, it refers to a pre-designed plan that shows how the Wang tiles can be placed next to each other such that the colors of shared edges are matched.T R[i, j]

Figure 4 :
Figure 4: Illustration of the construction process of a single geometric Wang tile.(a) four randomly chosen initial base patches inside rule-based geometry patch each of them is associated with a color.(b) the four colored base patches are further subdivided into four triangular subpatches to be used for horizontal and vertical edges.(c) a Wang tile is created by combining four triangular subpatches, collisions need to be resolved between the stitched triangular sub-patches.(d) the Wang tile has been further processed to resolve the collision.

Figure 5 :
Figure 5: Illustration of the tiling algorithm that shows the application and visualization of the geometric Wang tiles of SARS-CoV-2 model.From left: four randomly chosen initial base patches inside rule-based geometry patch, 16 Geometric Wang Tiles, the population of the tiles using a tile recipe rendered with and without Wang Tiles encoding.

Figure 6 :
Figure 6: Illustration of the textures obtained from the geometry tiles.From the left: texture generated by Wang tiling with highlighted 16 base tiles; Diffuse texture; Normal texture, Ambient-Occlusion texture.

Figure 7 :
Figure 7: Illustrating the indexing-based algorithm for cache buffers dynamic allocation.In this 2D example, the scene grid contains 16 cells and the number of active cells is 9. (a) and (b) show two different central cells, where in each case the central cell is highlighted with the yellow color, while the active cells are with blue color.

Figure 8 :
Figure 8: Illustration of different activation window layouts and sizes.In this 2D example, the central cell is highlighted with yellow while active cells with blue.The size of the activation widow is 9 cells on the top images and 25 on the bottom.In the left two images, the closest neighbor cells to the central cell are activated while on the right the closest neighbor cells to the central cell that intersect the frustum are activated.

Figure 9 :
Figure 9: Illustration describes a simple example for membrane population at the top and a more general example at the bottom.(a) we have a triangular mesh and we want to populate the triangle t (highlighted with orange color) knowing its xyz world coordinates and uv texture coordinates.(b) projecting the triangle t to the tiling recipe.Based on the triangle's uv texture coordinates, the replication area (rep R ) that encloses the triangle t is identified and then re-projected with its respective tiles onto the triangle in the 3D space.(c) the µγ tile coordinates of a representative Wang tile are illustrated.To populate a molecular instance m, we need to know its location in the world space m xyz .To achieve that, we first map the instance from its tile coordinate m µγ to uv texture coordinate m uv and then we use barycentric coordinates to define its 3D position m xyz .(d) shows a more general example.Here, the tiling recipe (TR) has 4 rows and 5 columns.The position of t.min uv determines the reference tile r from the tiling recipe T R[1,1].Based on the size of the triangle, the number of tiles needed to cover the triangle is estimated.In this example, the replication area (rep R ) has 3 rows and 4 columns.The offset in the uv texture coordinates refers to the vector from t.min uv to the origin of the reference tile r.min uv .
1) is determined.Every mesh is associated with a box-tile b.The box tiles b ∈ B are tiled inside the cells to fill the internal space of the mesh.Analogously to the membrane population, we populate all molecular instances in parallel.The replication area for any cell c has the same dimensions (rep B .dim), representing the number of box tiles needed to fill c completely with box tiles.We run (N B × |rep B |) threads where N B represent the maximal number of molecular instances that can be found in any box tile b ∈ B.

Figure 11 :
Figure 11: This image shows two types of acceleration structures used by the renderer at the top atomistic AS, which contains the atomistic scale description.The atomistic AS contains several TLASes, a TLAS per active cell.Every TLAS has several instances that point to one of the BLASes.At the bottom is the cellular AS, which contains only one TLAS representing the skeleton of the scene.

Figure 12 :
Figure 12: Example of the resulting scene rendered using our RTX pipeline.Top-Left: partially populated RBC membrane with an overview of the scene.Top-Right: Several populated SARS-CoV-2 particles partially overlayed by continuous tiling.Bottom-Left: View from the populated RBC membrane towards partially populated and not populated SARS-CoV-2 particles.Bottom-Right: View inside SARS-CoV-2 particle.Higher resolution images can be found in Appendix ??.

Table 2 :Figure 13 :
Figure 13: Result of two different activation window layouts.On the left is the result of activating the closest neighbor cells while on the right the closest neighbor cells that intersect the frustum are activated.

Figure 14 :
Figure 14: Illustration of the relation of texturing quality (left column) to the resulting populated model (right column).Top row: every triangle of the mesh is textured independently.Middle row: the parts of the mesh with undistorted texture mapping.Bottom row: distorted texturing of an arbitrary mesh.The middle column illustrates TR application onto the mesh.
).No in-core algorithm can render a single RBC in atomistic detail because it doesn't fit into the current sizes of GPU memory.A single RBC contains approx.1.2 trillion atoms.Saving just the position (3× float) and type (1× byte) for an atom requires (13× bytes).This means a single RBC requires 15.6 trillion bytes (approx.1.5TB).