Categorical Representation Learning and RG flow operators for algorithmic classifiers

Following the earlier formalism of the categorical representation learning (arXiv:2103.14770) by the first two authors, we discuss the construction of the"RG-flow based categorifier". Borrowing ideas from theory of renormalization group flows (RG) in quantum field theory, holographic duality, and hyperbolic geometry, and mixing them with neural ODE's, we construct a new algorithmic natural language processing (NLP) architecture, called the RG-flow categorifier or for short the RG categorifier, which is capable of data classification and generation in all layers. We apply our algorithmic platform to biomedical data sets and show its performance in the field of sequence-to-function mapping. In particular we apply the RG categorifier to particular genomic sequences of flu viruses and show how our technology is capable of extracting the information from given genomic sequences, find their hidden symmetries and dominant features, classify them and use the trained data to make stochastic prediction of new plausible generated sequences associated with new set of viruses which could avoid the human immune system. The content of the current article is part of the recent US patent application submitted by first two authors (U.S. Patent Application No.: 63/313.504).

The renormalization group (RG) [23] is a powerful and useful set of methods developed in statistical physics and quantum field theory to deal with many-body problems. It helps the physicists to establish the connection between the microscopic laws of physics and the macroscopic collective behaviors of the system. It starts with a many-body system at the microscopic scale, and then performs the coarse graining iteratively to group the fundamental building blocks together into larger and larger clusters. Meanwhile, it constructs the effective descriptions of the clusters at each different scale and extracts the effective interaction among them. At the end, the many-body system can be reduced to a few-body system at the highest scales, which enables the understanding of complex systems and their collective behaviors at large scale.
This idea can be particularly useful for representation learning and classification tasks in machine learning. There are many examples of many-body systems in machine learning tasks. For instance, an image can be viewed as a system of many pixels, and a sequence can be viewed as a system of many tokens. It is desired to see whether the idea of renormalization group can also be applied to extract the overall representations of images and sequences from their microscopic representations.
In terms of mathematics, the existence of profound connections between quantum field theory and geometry/ topology has been a source of many exciting research activities. One of them, as an example, is the interesting connection between theory of RG flows, for a particular set of quantum field theories in physics, and the geometric theory of Ricci flows in mathematics. The theory of Ricci flows was developed by Richard Hamilton in the 80's [9,[12][13][14][15]. Given a smooth manifold, M , a Riemannian metric, g, on M defines a bilinear positive-definite product on tangent space, T p M , for each point p ∈ M . This bilinear form is a 2-tensor which locally in an open neighborhood U ⊂ M of p, will have a matrix representation. One can then investigate whether infinitesimal deformations of the metric on M would provide interesting information about its geometry or topology. For instance given a 1-parameter family, g t , t ∈ (a, b) of metrics on M , one can study the variation of g with respect to the parameter t. The derivative ∂g t ∂t will then provide for every fixed choice of t and fixed point p a bilinear inner product form (i.e. a 2-tensor) on T p M . It turns out that variation of metric in a 1-parameter family provides one with a differential equation where the term on the right hand side is the Ricci curvature tensor, Ric gt , named after Gregorio Ricci-Curbastro, measuring that, how for each fixed choice of g t , the geometry of space is curved as one moves along the geodesics on the manifold M .
The connection between RG flows for nonlinear sigma models in physics and the Ricci flow for Riemannian manifolds in mathematics is quite known for a while, since the earlier work of Daniel Freidan [8], Zamalodchikov [29], Tseytlin [26], as well as ground breaking work of Gregory Perelman [22] in proof of Poincare conjecture, and more recently Carfora [1]. In he next section we briefly provide an expository account of RG flows in the context of Ricci geometry following the work of Carfora [1]. It must be noted that our focus in the current article is to implement RG flows for developing algorithmic architectures in mathematical artificial intelligence, therefore later we quickly diverge from its connection to Ricci geometry, and focus solely on RG networks. We encourage the interested readers to study the resources provided above to gain a deeper understanding of the connections between the two frameworks in physics and mathematics.
When it comes to implementing RG flow theory in machine learning, the key challenge lies in the difficulty in constructing the coarse graining transformation at each RG step. In physics, the RG rules are usually specified by human, such as the majority vote in real-space RG or the momentum-shell integration in field theoretic RG. These intuitions may not be immediately applicable to realistic dataset of images and sequences, as the underlying coarse graining rules may be much more complicated compared to physics systems. This calls for machine learning methods to enable algorithm to design and optimize the RG transformation in adaptation to the given dataset. One important idea is borrowed from the holographic duality in physics, which states that the RG transformation can be viewed as a holographic mapping of a field configuration from a flat (boundary) space to a hyperbolic (bulk) space with one-higher dimension, such that the long-range correlation in the original field configuration can be equivalently represented as short-range correlation in the bulk space. So the optimal RG transformation can be defined as a bijective holographic mapping that disentangles the features at different hierarchies as much as possible. This allows us to embed the bijective holographic map in the flow-based generative model, and use the unsupervised machine learning technique to train the optimal RG transformation. This idea is first proposed in Ref. [20] and further developed in later works [17,18]. The current article further develops the machine-learning RG method by combining the RG-flow model with neural ODE techniques [2], and explores its application to representation learning of sequential data. A large part of current section is based on work of Carfora [1] in relating RG flows for a specific set of QFT's and the Ricci flow construction for Riemannian manifolds. Moreover, the content in this section owes its existence to another highly recommended source, specially for a working mathematician, that is the outstanding work of Kevin Costello [3] in his mathematical formulation of perturbative quantum field theory.

ACKNOWLEDGEMENTS
For the time being, we use the introduction to a geometric construction of the RG flow, outlined below, as it is suitably intuitive, pleasantly elegant, and mainly since it will provide us later with the shortest pathways to generalize our constructions in several ways, for instance: by deviating from the classical setup, via altering (and generalizing) our action integrals made with Laplacians and curvature forms to more general actions, or by altering the base geometrical spaces from smooth manifolds to non-smooth algebraic varieties, or discrete lattices.
Let C, X denote respectively a compact oriented Riemann surface and a compact oriented smooth manifold of dimension at least 2, both equipped with a Riemannian metric, and defined over a base number field K. Let Map(C, X) := {f : C → X} be the associated space of all continuous maps from domain C to X. The construction of RG flow is based on considering a family of Lagrangians L(f, φ i , i = 1, · · · , n) associated to this space, defined as a morphism taking a tuple of fields (f, φ 1 , · · · , φ n ) on X to the space of smooth integrable functions on X. Note that here the notatation H * (X, K) ⊗n means that the fields φ i , i = 1, · · · , n are realized as sections of a sheaf of differentially graded algebras over X, sitting in appropriate cohomological degrees on X. Moreover, we require the Lagrangians to be invariant under the action of diffeomorphism groups, Diff(C), Diff(X) on C and X respectively. Integrating the Lagrangian over the associated domain Riemann surface induces the Lagrangian action integral Let the metric tensors on C and X be respectively denoted by µ mn , m, n = 1, 2 and g ij , i, j = 1, · · · n. Suppose that the local coordinates on C are given by x (that is x := (x 1 , x 2 )). Then a typical form of such Lagrangian action integral as defined above is given as where λ is a coupling parameter, ν C is a measure on C, ρ : X → K ∈ C ∞ (X) is a smooth function on X, and K is the Gaussian curvature on C with respect to the metric µ. Here the fields associated to the Lagrangian action integral are given as φ = λ −1 (g, λρ).
Remark 2.1. By this notation we mean that the coupling constant λ has dependence on parameters g, ρ.
Remark 2.2. By writing the action in terms of the Laplacian + curvature form in Eq. (1), we have assumed to study the RG flow of this particular conformal field theory (CFT). However, RG flow can be more generally defined for any field theory with any action to start with, not necessarily near a conformal fixed point. See Sec. 4.4 for more discussions of RG flow around general fixed points.

Deformation family of Lagrangian action integrals. Let us denote
where φ 0 := λ −1 (g 0 , 0) is a field associated to the fixed choice of g 0 . One interesting case of study is to identify the moduli space (the geometric space representing the family) of smooth maps f : C → X which minimize the action integral S(f, φ 0 ) for fixed choice of metric g 0 over X. These are often identified with vacuum states of the underlying governing physical theory for our system of particles. A rather more interesting question is whether the vacuum states of the underlying theory are stable with respect to infinitesimal deformations of the geometry of C and X respectively, specially in quantum physics where fields and geometry of space undergo algebraic or analytic fluctuations. This question could be rigorously studied via inducing deformations of the fields involved in our physical theory, that is where the function h ∈ C ∞ (X, T ∨ X ⊗2 ) is a seymmetric bilinear smooth differential form on X and ρ ∈ C ∞ (X, K) is a smooth function on X. Introducing these deformation parameters, one can study the set of extremizing maps f : C → X of the action integral S(f, φ), that is smooth harmonic maps minimizing S(f, φ), where S(f, φ) is obtained as a local deformation around S(f, φ 0 ) induced by deforming the geometry of C, X. Let us consider a generalized deformed Lagrangian action where as before h ∈ C ∞ (X, T ∨ X ⊗2 ), Γ ∈ C ∞ (X, K), and ω ∈ C ∞ (X, ∧ 2 T ∨ X) an antisymmetric bilinear form are all regarded as infinitesimal induced deformation parameters. Note that here the deformation parameters φ 1 := λ −1 h, φ 2 = λ −1 (λρ), φ 3 := λ −1 U and φ 4 := λ −1 ω may, roughly speaking, be regarded as local coordinates in the space of deformations of S(f, φ 0 ). Hence we can rewrite one such deformation in terms of the other as an extension Moreover, it must be noted that depending on the underlying physical theory, one may consider situations where S(f, φ 0 ) is required to be invariant under conformal transformations (C, µ mn ) → (C, e −ψ µ mn ), in which case, shall one be interested to preserve the conformal invariance of the deformed Lagrangian action S(f, φ), one requires that the deformation fields ρ and U vanish, as they break the conformal symmetry, however the deformations h, ω can be non-vanishing, as their associated integrals are preserved under conformal group action on C.

2.3.
Moduli functors associated to deforming fields and maps simultaneously. We mimic the approach of algebraic geometers for constructing our moduli spaces. Consider the following situation. Let T → SpecK be a finite type parametrizing scheme. The notation means that T is a space (known as parametrizing scheme in algebraic geometry terms) constructed over field of numbers K that is topologically compact. Let Map(C, X) : Sch/K → Ab/K be defined as a two-category (i.e. a category which contains objects, their morphisms, and their morphisms of morphisms , also known as 2-morphisms), such that the category is fibered over a base category of finite type (parametrizing) schemes over K. The objective of such functor is to produce families of maps from C to X parametrized by schemes such as T . To state the latter functionality of Map(C, X) in more mathematical formal terms, we say that the groupoid sections of Map(C, X) over any T are given by the sheaf of Abelian groups of T -families of smooth maps from C to X, that is the groupoid sections of our moduli functor are given by families of maps such that for any t ∈ T the t-fibers of the familyf | t ∼ = {f t : C → X} are given by smooth continuous maps from domain Riemann surface C to X. Roughly speaking, the functor Map(C, X) provides us with a platform to parametrize the smooth maps from C to X in a systematic way over any chosen parametrizing scheme. For instance, given any T := Spec(K), geometric reduced point, the groupoid sections of Map(C, X)(T ) are given by single maps f : C → X. Similarly, the fibers of Map(C, X) over a line, L (which as a geometric scheme belongs to our category, Sch/K, of schemes of K) provides a one dimensional family of maps f L : C L → X, and the fibers of Map(C, X) over a surface provides a two dimensional family of maps, etc.
Now as the geometric structure of C, X, and hence f undergo deformations in our theory, similar to Feynman path integration formalism, we compute the vaccum states of the theory, by taking a stochastic average over all admissible weighted morphisms f : C → X which satisfy smoothness property. In doing so, we further allow certain induced correlation fields, defined in our theory, induced by evaluating the map f at a finite number of smooth distinct marked points p 1 , · · · , p l ∈ C. Moreover, we use the Lagrangian action integral constructed in previous section as a weight function associated to each single map f : C → X. Doing so, we obtain an integral over the space parametrizing tuples (f : C → X, p 1 , · · · , p l ), where p i , i = 1, · · · , l are distinct smooth marked points on C (5) Z[C, X, p 1 , · · · , p n , φ] : Here D φ (f ) is a measure over Map(C, p 1 , · · · , p l , X). Note that by construction S(f, φ) is regarded as a deformation of S(f, φ 0 ), hence following the construction in (3), one may rewrite correlation function (5) in terms of S(f, φ 0 ) as follows

RENORMALIZATION SEMI-GROUP FLOW
The construction of the renormalization semi-group flow is based on the fact that, in order to make the above integrals well-defined, one may merely consider certain controllable deformation regimes for the fields φ i , that is; one would like to consider a family of the fields φ i (T ), where the scheme T is the parametrizing scheme, used in (4) governing the geometric deformations of maps f T : C T → X induced by perturbation of geometric structures of C and X. The idea is to consider an infinitesimal deformation flow, called renormalization semi-group flow (as it turns out that our construction in this example only provides a semi-group rather than a group), over the moduli space of maps and field deformations, that is, to consider a morphism which has a lift to a morphism on moduli space of action integrals , which satisfies the semi-group property.
Remark 3.1. We remark again that we are considering, generally speaking, our fields φ i , i = 1, · · · , n as living in our field algebra, that is the vector space H * (X, K) ⊗n generated by differentially graded forms on X. Moreover, the action integrals are regarded as morphisms from Map(C, X)(T ) × H * (X, K) to the underlying ground field K, and hence, realized as the dual space We now elaborate further on renormalization flow. In order to define it we need to formulate a deformation process, applied to geometry of C, X, then compute the induced deformations of associated fields φ i and f with support on deformed X as shown in Equation (7). Note that the functorial construction of the moduli space of maps allows us to perform this task in a rigorous algebraic manner. Take a scheme T (naively speaking schemes have as their skeleton, the geometrical spaces however, they come further equipped with extra topological or algebraic properties). As we noted above, the fibers of the moduli functor Map(C, X) over T (i.e. Map(C, X)(T )) provide us with a T -family of maps from C → X as in (4). Now choose an algebraic deformation (a perturbation) of T and denote it by T . Then the fibers Map(C, X)(T ) provide a T -family realized as a deformation of the former T -family maps from C to X.
One way of constructing such algrbraic deformation is to construct T as a nilpotent thickening of T . We elaborate on this notion, using the language of ideals over the ring of polynomial functions.
Take the polynomial ring C[x 1 , · · · , x n ]. In classical algebraic geometry, the set of prime ideals generated by different expressions involving the variables x 1 , · · · , x n makes a space, isomorphic to the "affine" space C n . Now in order to obtain more interesting spaces, one may consider an ideal, say as an example 3 ), and consider the quotient ring C[x 1 , · · · , x n ]/I. This expression means that all polynomials generated by the expression x 1 x 2 − x 2 3 vanish on this quotient ring. Now the set of prime ideals p ⊂ C[x 1 , · · · , x n ]/I provides us with set of geometric points of the algebraic space (algebraic variety) given as the solution set to the polynomial equation Let us denote this algebraic variety as T . In order to obtain a nilpotent thickening of T one can simply construct the quotient ring C[x 1 , · · · , x n ]/I l for some l. The set of prime ideals in the latter provides one with the set of geometric points of the variety obtained as the solution set to (x 1 x 2 − x 2 3 ) l = 0, call the latter space as T . Due to the natural inclusion of ideals I l ⊂ I, one can immediately obtain a natural inclusion of T → T . This deformation is called a nilpotent extension of T of order l. Given a deformation as such nilpotent extension, ι T T : T → T , as we elaborated earlier, the renormalization flow must satisfy the property that ). Since the action of the RG flow is realized as a pullback in our construction, one is able to define its induced action on the correlation function defined in (6) as follows Let us work out a concrete example.
Example 3.2. For simplicity, let us assume that K is given as a field of characteristic zero, such as C, the field of complex numbers. Consider the case where T := Spec(K[x 1 , x 2 , · · · , x n ]/(x 2 , · · · , x n )) ∼ = A 1 is given by taking the Zariski spectrum of the affine line in the direction x 1 , given by ideal I = (x 2 , · · · , x n ) over K. Locally, after choosing a coordinate chart (x 1 , · · · , x n ), the set geometric points in T is the set of points on the x 1 axis in C n . Now we introduce an infinitesimal deformation of T → T , induced by a nilpotent extension of order 2, by taking T := Spec(K[x 1 , · · · , x n ]/I 2 ). There exists a canonical short exact sequence whose kernel is governed by the conormal sheaf (which in here is identified by sheaf of differential one forms on T , that is Ω T ). This roughly speaking realizes the second order nilpotent thickening of T , as the cotangent bundle, Ω T , of T . We would like to deform the correlation function (6) in the direction of fibers of Ω T . This amounts to setting RG T as the differential operator which deforms the fields in direction of fibers of cotangent bundle of T , that is, RG flow acts on the fields as a map φ → φ + dφ and hence its induced action on the action integral is given by Therefore, viewing the RG flow as a differential operator acting on the action integral, Z, and rewriting the variation of Z, induced by nilpotent deformation of T , in terms of Z itself, we obtain a differential equation governing the change in Z, that is We will come back to the discrete version of the differential equation above, when discussing the construction of RG flow in AI.

FROM CONVENTIONAL RG TO MACHINE-LEARNING RG
The above section formulates the mathematical foundation for conventional RG. However, there are several aspects that should be upgraded before the idea of RG can find useful applications in machine learning. The main differences between the conventional RG and the machine-learning RG are summarized in Tab. 1, and discussed as follows. Continuous v.s. Discrete. The conventional RG in the quantum field theory typically assumes that the field is defined on a smooth base manifold. However, this assumption is typically not the case for machine learning applications. For example, images are defined on discrete pixels, and texts are defined on discrete words. The discrete nature of most datasets in machine learning requires us to generalize the base manifold from continuous space to discrete lattice. The discretization of the base manifold also forces the RG flow to be discrete, because it is no longer possible to perform infinitesimal dilation on a discrete lattice. Therefore, instead of writing down a differential equation to describe the continuous RG flow, the discrete RG flow should be described by a recurrent equation. However, in the continuum limit (when the lattice spacing approaches to zero), the recurrent equation should converge to the differential equation, which will be shown in Sec. 5.11.

4.2.
Semigroup v.s. Group. The conventional RG keeps decimating information in each step of coarse-graining. As a result, the conventional RG is not invertible and only forms a semigroup instead of a group, despite of its inaccurate name of renormalization "group". Recent development in physics [24] reveals that the RG flow can actually be viewed as a holographic mapping, which is invertible. This not only makes a profound connection from RG to quantum gravity, but also promotes RG to a group.
The conventional RG studies how perturbations of the action (or deformations of the field configuration) gets renormalized at larger and larger scales. The invertible RG has a completely different mindset: it aims to answer how the correlated field configurations on the holographic boundary can be disentangled to uncorrelated noises in the holographic bulk, or how the strongly-coupled quantum field theory on the holographic boundary can be reformulated as the weakly-coupled dual gravitational theory in the holographic bulk. By establishing the holographic mapping, any deformation of the field configuration on the holographic boundary can be translated into an excitation in the holographic bulk and analyzed more conveniently. Therefore invertible RG is a more powerful paradigm of RG. Nevertheless, it can always fall back to the conventional RG by a forgetful map that forgets about the holographic bulk degrees of freedom.

4.3.
Human v.s. Machine. The conventional RG scheme is designed by human. Due to the limitation of human intelligence, the conventional RG always assumes that the action must take a fixed form with specific types of terms, and the RG flow only change the coefficients of these terms, such that the action can only flow within a predefined moduli space. Although the moduli space allows us to parameterize the action conveniently, it also restricts our imagination. A more general RG flow can go beyond the moduli space, as new terms can be generated under RG and even the field content can change under RG (microscopic and macroscopic descriptions of a system can be fundamentally different as advocated by the emergence principle). However, such a general RG scheme is not analytically tractable by human. It is not even clear how to design the RG scheme if the form of the action and the field content are all unknown. Thus it becomes desirable to introduce artificial intelligence to learn the optimal RG scheme automatically from the big data of field configurations generated by a field theory. By learning to generate similar field configurations from independent random noise in the holographic bulk, the machine will create the optimal holographic mapping, which also specifies the optimal (invertible) RG scheme.

4.4.
Conformal v.s. General Fixed Point. Conventional RG typically assume a conformal fixed point to start with. Given the conformal symmetry at the fixed point, the RG transformation is always taken to be the dilation operator in the conformal group, which corresponds to the rescaling of spacetime and fields together. Given the RG transformation, one can study how a perturbation (or deformation) of the field evolves under dilation. If the perturbation grows stronger/weaker at larger scales, then the perturbation is said to be relevant/irrelevant (with respect to the conformal fixed point). More quantitatively, the conformal dimensions can be defined as the eigenvalues of the dilation generator, such that relevant/irrelevant fields are simply distinguished by their positive/negative conformal dimensions. Intuitively, relevant fields are lowenergy/slow-varying modes to be kept under coarse-graining, and irrelevant fields are high-energy/fast-varying modes to be decimated (or integrated out).
However, the more general machine-learning RG do not assume a conformal fixed point, because the real-world data (like images or texts) may not be scale-invariant and hence not respecting the conformal symmetry. Therefore, the dilation operator is not well-defined, and one can not prescribe an explicit RG scheme from the beginning. The RG scheme has to be learned from data using a data-driven approach. In fact, the realworld data is more likely to be closer to Gaussian fixed points. So even if one learns the RG scheme, it is not immediately clear whether the RG transformation can be used to infer the conformal dimension, as the data could be far from any conformal fixed point. 4.5. Relevant vs. Irrelevant. Therefore, the traditional idea of calculating scaling dimension as eigenvalues of the dilation generator no longer works in more general RG approaches. We need a different way to define what is relevant and what is irrelevant. Ref. [17] proposes an elegant and universal definition of irrelevant degrees of freedom using holographic duality and information theory. The key idea is that irrelevant fields are those degrees of freedoms that should be decimated under coarse-graining, so they should appear to us as random noise (i.e. independent/uncorrelated random variables). Since the irrelevant fields are actually the holographic bulk field under the holographic duality, the above idea can also be rephrased to a statement that holographic bulk fields are almost uncorrelated. The goal of machine-learning RG is to learn the RG transformation that automatically identify and separate such irrelevant degrees of freedom in a field theory. We will explain this approach in more details in Sec. 5.5, after introducing the concrete construction of the machine-learning RG algorithm.
For now, we would like to comment that the information theoretic definition of the irrelevant field is consistent with the conformal dimension definition in the conformal limit. Because a negative conformal dimension in the conformal field theory (CFT) indicates that the field correlation will decay exponentially in the dual anti-de Sitter (AdS) holographic bulk, which is equivalent to the statement that the holographic bulk field are short-range correlated, which look like independent random noises beyond a finite correlation length, and are therefore irrelevant in the information theoretic sense.
5. MACHINE-LEARNING RG VIA FLOW-BASED GENERATIVE MODELS 5.1. Sequential Data and Quantum Field on One-Dimensional Lattice. The idea of renormalization group can be used to construct novel generative models for unsupervised learning. The discussion will mainly focus on sequential data, although generalizations to images and graphs are possible. A sequence is an ordered set of objects a = (a 1 , a 2 , · · · ), where each object a i ∈ A is taken from an object set A (also known as the vocabulary). In machine learning, each object a i is usually embedded as a vector φ i in a finite-dimensional vector space R n (assuming the dimension to be n). Denote the embedding map as E : A → R n , the sequence can be represented as a ordered set One can also view φ i as a quantum field on one-dimensional discrete lattice, as described by the mapping φ : I → R n , where I ⊂ N denotes the index set (equipped with an ordering). Each index i ∈ I labels an object (or its vector embeding) in the sequence and the set I describes the one-dimensional lattice. The size (cardinality) |I| of the index set corresponds to the length of the sequence. Let Map(I, R n ) := {φ : I → R n } be the associated space of all maps from the index set I to the vector space R n . The objective of unsupervised machine learning is to model the probability measure p(φ)Dφ given the dataset of sequences.

Conventional
Renormalization froms a Semigroup. The conventional notion of renormalization group transformation R : Map(I, R n ) → Map(I , R n ) corresponds to a coarse-graining map that extract the relevant (coarse-grained) field φ = R(φ) from the original (fine-grained) field φ and discard the remaining (irrelevant) field degrees of freedom. The renormalization transformation always reduces the degrees of freedom, therefore the index set will become smaller |I | ≤ |I| under the renormalization transformation. Because of the information loss, it is no-longer possible to recover the original field configuration φ from the coarse-grained configuration φ . Therefore the renormalization transformation R is not invertible, and only forms a semigroup.

Invertible Renormalization forms a Group.
The key idea to make the renormalization transformation invertible is to keep the irrelevant field ζ together with the relevant field φ as the joint output of the renormalization transformation. Intuitively, the relevant/irrelevant fields are the low-/high-energy modes in the field configuration. What the renormalization transformation does is to separate the irrelevant field ζ and the relevant φ field given the original field φ as input. The criterion to separate irrelevant fields will be elaborated in Sec. 5.5.
The bijectivity requires |I | + |J | = |I|, i.e. the numbers of relevant and irrelevant features must add up to the total number of features in the original field. The inverse renormalization transformationR −1 will also be called the generation transformation G, denoted as As the transformation is invertible, the renormalization group (RG) is promoted from a semigroup to a group.

Renormalization Group Flow.
The invertible renormalization transformation enables us to define invertible renormalization group (RG) flow on both the field configuration level and the probability measure (or the action) level.

RG Flow on the Field Level.
Repeating the invertible renormalization transformation, an RG flow can be defined (on the field configuration level) via the following iteration where φ (k) ∈ Map(I (k) , R n ), ζ (k) ∈ Map(J (k) , R n ) are the relevant and irrelevant fields, andR (k) : Map(I (k−1) , R n ) → Map(I (k) , R n ) ⊗ Map(J (k) , R n ) is the (bijective) renormalization transformation at the k-th step. The condition |I (k) | + |J (k) | = |I (k−1) | is always satisfied as a necessary condition for the bijectivity. The iteration defines a flow of quantum fields, called the renormalization flow (R-flow): .
The entire G-flow corresponds to a map that decodes the irrelevant fields ζ to the original field φ, denoted as φ =Ĝ(ζ).

5.4.2.
RG Flow on the Probability Measure (Action) Level. The RG flow of field φ → ζ induces a flow of the associated probability distribution over Map(I, R n ). Under the bijective map between the original field φ and the irrelevant field ζ, the probability measure must remain invariant Given ζ =R(φ) and φ =Ĝ(ζ), Eq. (18) implies that the probability distributions are related by where | det ∂ φR (φ)| denotes the absolute value of the Jacobian determinant of the trans-formationR, and similarly for | det ∂ ζĜ (ζ)|. More specifically, in each step of the transformation, the probability measure is deformed by (along the G-flow) In quantum field theory, the field action is defined as the negative log-likelihood of the field configuration, i.e. S (k) Z . In terms of the field action, the transformation relates where the coupling action S (k) ΦZ (φ (k) , ζ (k) ) is defined to be the log Jacobian determinant of theĜ (k) transformation, Therefore the renormalization transformationR of the relevant field φ (k) induces the deformationḠ of the relevant field action S (k) Φ (φ (k) ) along the generative direction .
In this way, the renormalization flow of the actionR :=Ḡ −1 is defined as the pullback of the renormalization flowR of the field. The RG transformation is invertible on both the field and the action level, making renormalization group literally a group.

Criterion to Separate Irrelevant Fields.
What has not been explained so far is the criterion to separate relevant fields from irrelevant fields. Ref. [17] argues that the irrelevant field should look like independent random variables (or random maps), because the irrelevant fields are supposed to be discarded under the conventional RG flow, meaning that (in the ideal limit) they do not contain information and should appear like random noise. Guided by this intuition, Ref. [17] further proposes the minimal bulk mutual information (minBMI) principle as the designing principle of renormalization flow, that the optimal renormalization transformations {R (k) } k=1:K should be defined as the maps that minimize the mutual information among all irrelevant fields (24) k,k ,j,j The minimum is achieved when the irrelevant fields are statistically independent, i.e. (25) p such that all mutual information vanishes. The optimal solution ofR that converges to this limit can be found using machine learning approaches, by constructing a trainable bijective mapĜ :=R −1 (as the composition of smaller bijective mapsĜ (k) at each RG step) to reproduce the data distribution p Φ (φ) starting from the independent prior distribution p Z (ζ) in Eq. (25). The related methods were developed in Refs. [17,18,20] under the name of neural-RG. A conventional choice is to take each p Z (ζ 2 ) to be the standard normal distribution (Gaussian with zero mean and unit variance), such that This action describes that the irrelevant field fluctuation is massive in the holographic bulk, which is compatible with the idea of holographic duality. 5.6. Hierarchical Structure and Hyperbolic Space. As the renormalization transformation reduces the relevant degrees of freedom, the size of the relevant index set gradually reduces |I (k) | ≤ |I (k−1) |. To be more concrete, we can restrict our discussion to the case where the degrees of freedom is reduced by half under each renormalization transformation, i.e. |I (k) | = |I (k−1) |/2, such that Then the condition |I (k) | + |J (k) | = |I (k−1) | implies |J (k) | = 2 −k |I (0) |. The RG flow will stop when |I (K) | < 1, which sets the total number K of RG steps to be As illustrated in Fig. 1, the hierarchical structure of the RG flow generates a ordered collection of index sets {J (k) } k=1:K , which can be union into a hyperbolic lattice (a discrete hyperbolic space), described by Instead of thinking the irrelevant fields as separate mappings ζ (k) ∈ Map(J (k) , R n ), we can treat them jointly as a field ζ ∈ Map(J, R n ) defined on the hyperbolic lattice J. Therefore, the R-flow ζ =R(φ) and the G-flow φ =Ĝ(ζ) respectively define the (4) ζ (5) flat space encoding and decoding maps that connect the field φ in one-dimensional flat space to the field ζ in two-dimensional hyperbolic space, which explicitly realize the holographic duality in quantum gravity.

Realization of Bijective Transformation.
To optimize the renormalization trans-formationR, one relies on the construction of a trainable bijective map to modelR. Machine learning community has provided several realizations of trainable bijective maps, including real NVP [4] and neural ODE [2]. In the following, we will focus on the neural ODE realization, as it can capture multi-modular features better than real NVP, which is more suitable for processing sequences of discrete objects. 5.7.1. Neural ODE. Each single-step renormalization transform (φ , ζ ) =R(φ) can be realized by an ordinary differential equation (ODE). Starting from φ(0) = φ, first evolve φ(t) from t = 0 to t = 1 following where f θ is a trainable function (realized as a neural network) parameterized by neural network parameters θ. Then split the result as φ(1) = (φ , ζ ) to obtain φ and ζ . t is considered as an auxiliary time. The inverse transformation is simply given by the time-reversal evolution, therefore the mapping is indeed bijective as desired.
Apart from the transformation, the log Jacobian determinant ofR can also be evaluated. Based on the ODE in Eq. (30), one have which can be integrated to Given thatĜ :=R −1 , its log Jacobian determinant is simply given by a negation, which will be useful for the evaluation of the coupling action in Eq. (22).

Locality and Translational Symmetry.
It is possible to design the ODE function f θ to respect locality and translational symmetry. The idea is to realize f θ using layers of convolutional neural networks (CNN) with finite kernel followed by element-wise activations.

Objective Function.
The objective is to train the generative model, such that the model distribution p Φ (φ) matches the data distribution p dat (φ) as much as possible. The objective can be achieved by minimizing the Kullback-Leibler (KL) divergence where S Φ (φ) = − log p Φ (φ) is the model action (as the negative log-likelihood), and H(p dat ) = − Dφ p dat (φ) log p dat (φ) is the data entropy. As the data entropy H(p dat ) is independent of the model parameter, it can be dropped from the loss function L. Therefore the loss function is essentially the ensemble average of the model action on the dataset. By minimizing the average action, the ODE function f θ in each RG transformation will get trained. Upon convergence, the algorithm will find the optimal invertible RG flow that maps the (presumably) strongly coupled original field φ on the holographic boundary to the weakly coupled irrelevant field ζ in the holographic bulk. 5.9. Summary of the Algorithm. Given a set of sequences from the data, the learning algorithm goes as follows.
(1) For each given sequence a = (a 1 , a 2 , · · · ), represent each object a i in the sequence as a vector φ i = E(a i ) ∈ R n . Denote the sequence of vectors as a vector field φ = (φ 1 , φ 2 , · · · ) ∈ Map(I, R n ).
(a) Each step of the transformation is implemented by solving an ODE starting from the initial condition φ (k−1) (0) = φ (k−1) , integrating from t = 0 to t = 1, and then splitting the final result into φ (k−1) (1) = (φ (k) , ζ (k) ). (b) While solving the ODE, simultaneously integrate along the time evolution to obtain the coupling action (3) Starting from the initial condition S (K) Φ = 0, collect the action in the reverse order (along the generation flow) The resulting total action will be denoted as S Φ (φ) := S

Potential Applications and Advantages.
After training, the model could potentially be used for the following tasks.
• Inference of hierarchical latent representation. Using ζ =R(φ), one can infer the hierarchical latent representation ζ of any sequence encoding φ. The highlevel representations (ζ (k) with a large k) can be viewed as the encoding of the entire sequence, which can be used in downstream tasks like classification and translation. • Likelihood estimation. Using S Φ (φ), one can estimate the probability density p Φ (φ) ∝ exp(−S Φ (φ)) for any field configuration φ. This will be useful for anomaly detection. • Sample generation. As a generative model, new samples can be generated by first sampling ζ in the hyperbolic space and then transforming to φ =Ĝ(ζ) using the generation flow, which may find applications in completing missing objects in a sequence.
The propose algorithm is advantageous in the following aspects.
• Disentangled features in scales. The optimal RG flow distills features at different scales, allowing the model to capture the long-range and multi-scale correlation in the sequential data. The features are automatically arranged in a hyperbolic spaces, making it easy to access/control. • Efficient inference/generation. The hierarchical and iterative approach enables the model to infer latent fields or generate original fields in Θ(N ) complexity (given the sequence length N ), which is superior compared to the Θ(N 2 ) complexity of transformer-based approaches, especially when the sequence is long. • Ability to process hierarchical structure. The renormalization transformation can progressively extract coarse-grained features from fine-grained features, making it capable of capturing global features (such as the parity of bit strings).
In comparison, as shown in Ref. [11], self-attention-based models can not efficiently model hierarchical structures, unless the number of layers/heads increases with sequence length.

5.11.
Recovering Conventional RG by Integrating out Irrelevant Fields. Finally, we would like to comment that the invertible renormalization can fall back to the conventional renormalization by integrating out irrelevant fields. Recall Eq. (14) that in each step of the invertible renormalization transformation, the original (fine-grained) field φ (k−1) is separated into the relevant φ (k) and irrelevant ζ (k) fields by (φ (k) , ζ (k) ) = R (k) (φ (k−1) ). The invertible renormalizationR (k) can be downgraded to a non-invertible renormalization R (k) by a forgetful map which forgets about the irrelevant field ζ (k) , such that φ (k) = R (k) (φ (k−1) ) only transforms the the fine-grained field φ (k−1) to the coarse-grained field φ (k) . According to Eq. (21), the actions are related by , where the irrelevant field ζ (k) is massive, and is described by the Gaussian action S (k) Z (ζ (k) ) = 1 2 ζ (k) 2 as in Eq. (26). Because ζ (k) represents the high-energy modes that should be integrated out under renormalization, one can argue that the fluctuation of ζ (k) can be treated perturbatively due to its large mass, which justifies the expansion of the action around ζ (k) → 0, As the approximate action is quadratic in ζ (k) , one can perform a Gaussian integration for ζ (k) , under which the action becomes (37) Therefore one can define the renormalization transformationR (k) on the action via S , in correspondence to the field renormalization φ (k) = R (k) (φ (k−1) ). Based on Eq. (37), the explicit form of the renormalization operatorR can be given which reproduces the pullback construction of the action renormalization. If one further define the infinitesimal generator ofR asr = logR, the renormalization flow can be expressed as a differential equation [21,23] 6. EXPERIMENTS ON GENOMIC SEQUENCES 6.1. Problem Overview. Extracting the hidden information of genomic sequences has been a critical subject in biological research, with relevance to epidemiology, immuniology, protein design and many other subfields. With its great similarity to the natural language processing problems, there are numerous studies on applying machine learning techniques to extract information from genomic sequences. Various machine learning architectures include word2vec [6] [28], bidirectional long short-term memory [16], transformer [19] etc. However while the existing algorithms can provide a single-gene level embedding, they do not provide a canonical sequence embedding and the hierarchical information is not clear from the natural language models. Thus we apply the renormalization group idea from the previous sections to the genomic sequence representation problem, where the hierachial structure provides the biological information at different energy levels, i.e. the deeper layer can capture the longer correlation in the sequence, and thus provide a canonical embedding of the sequence with the deepest layer. We take the Influenza HA amino acid sequences as an example 1 , where the sequences are regarded as one-dimensional lattices as described in Sec. 5.1. Below shows samples of the sequence data, there is clear global features embedded as one can see the similarities between different sequences. FIGURE 2. Sample Influenza HA amino acid sequences. 1 can be downlaoded from the "Protein Sequence Search'" section of https://www.fludb.org 6.2. Single Amino Acid Distribution Learning. Before proceeding with the sequence into the RG-scheme, we need to verify that local features are efficiently learned. Thus we first look at the single amino acid distribution learning. As shown in Fig. 2, at a fixed location i among the sequences, there is a discrete distribution labeled by amino acid, we pick up that amino acid from each sequence. Then each sample is labeled by a = (a i ), where a i ∈ A represents the single amino acid. We apply the pre-trained single-amino-acid level embedding E : A → R n from Ref. [16] where n = 20. After embedding, the boundary field is φ = (φ i ), where φ i = E(a i ) ∈ R n . To remove the difficulty in transforming the boundary discrete distribution to the bulk uncorrelated continuous Gaussian distribution, we add a small randomness on the boundary field, i.e. φ i = φ i + , where p Z ( ) takes the standard normal distribution with small variance. For simplicity, in the following we still use φ i to denote the fields with small randomness.
With this setup, we train a neural ODE model to realize the bijective transformation between the data distribution to a Gaussian distribution as described in Sec. 5.7.1. To further speed up the training process, we have added the Jacobian and Kinetic regularization to find an optimal bijective map as in Ref. [7]. The input data is the 4th single amino acid from 1000 Influenza HA sequences. The ODE transformation f θ (x, t) is constructed by a feed forward network with 4 sequential hidden layers as shown in Fig. 3. The hidden layers are the concatsquash(CS) layers defined in Ref. [10]:  Fig. 4, a 2-dimensional feature space can be obtained by applying the t-distributed stochastic neighbor embedding (t-SNE) algorithm [27] to the original data and the flow generated data embedding vectors. The flow generated data are obtained by taking the inverse transformationR −1 from vectors with the Gaussian distribution. The original data distribution with the multi-modular feature can be perfectly captured after training.  6.3. Amino Acid Sequence Distribution Learning. To train on sequences, we adapt the hierachical RG scheme described in Sec. 5.6. As discussed in the previous section, the input sequence is represented as labels a = (a 1 , a 2 , · · · , a I ) with the cardinality I denotes the length of the sequence. With the pretrained embedding φ i = E(a i ) ∈ R n , the boundary fields are represented as φ = (φ 1 , φ 2 , · · · , φ I ). Thus we have the initial boundary fields φ (0) = φ, we can run the renormalization flow using Eq. 15. On the other hand, the generation flow Eq. 17 reconstruct the original field. Following the notation of MERA networks, each renormalization transformation layer consists of a disentangler layer and a decimator layer: where a disentangler layer disentangles the local correlations and a decimator layer separate the decimated fields out as the bulk fields. In Fig. 5, we show an illustration of the model structure with green blocks as disentanglers and yellow blocks as decimators. Each block is a bijective transformation with the neural ODE structure as in Fig 3. We can further explicitly write down the transformation equations: the covering length of a disentangler or a decimator is defined as the kernel length l. Then there are I 2 k l blocks . An illustration of the mera structure with kernel l = 2. Green blocks are disentangler blocks, yellow blocks are decimator blocks. After each decimator layer, half of the fields are redefined as bulk fields ζ.
in the k-th layer. For the m-th block in the k-th layer, where m ∈ {0, · · · , I 2 k l − 1}, the transformation is given by Half of the resulting fields are redefined as the bulk fields: ζ Here we have chosen a scheme that after each layer, every other existing fields are redefined as new bulk fields. Since there are position-dependent features among the sequences, to respect the local features of the sequence, we take independent block transformations as they are labeled by both layer index k and block index m. With this setup, we train on the objective L = Eφ∼p dat S Φ (φ) as described in Sec. 5.9.
In Fig. 6 and Fig. 7, we show the result when I = 4, l = 2 and when I = 16, l = 4 with the same set of data in the previous section. To compare the joint distribution, we concatenate the vector embeddings of the 4 and 16 amino acids for each sequence and train the t-SNE algorhim with these concated vectors. We also computed the normalized logarithmic probability defined as log n p = log p/(nI) with n the embedding dimension and I the sequence length. The numbers in the parentheses are normalized logarithmic probability before training. Both results shows that the original data joint distribution can be captured using the RG-scheme with local neural ODE blocks.  With training on the full sequence, one can have hierarchical information from each layer. This may give a natural way for escaping virus search. The shallow layers mainly capture the local information, while the deeper layers hold the global information. Escaping virus should have a good local fitness while mean a different content compared with the existing dataset. Then one can use this separation of information levels to design rules for escaping virus or train on a downstream classification task. 6.4. Learning Viral Escape Mutation. We conclude our investigation on learning protein sequences distribution by studying the predictive performance of viral escapes with our model. Viral escapes are those mutations in viral protein sequences that make them unrecognizable for human immune system. In other words, although they are still effective on human body and cause infection, the immune system does not flag the mutated sequence as a threat to body. Such mutations can be single or multiple, that is, only one or few amino acids can be instantly mutated, hence, identifying underlying patterns in viral escape mutations will be essential for viral vaccine development. As described in Ref. [16], in terms of language models, a viral sequence can be regarded as a textual data and a viral escape mutation is seen as a word change in a sentence that does changes the semantic of the sentence however the sentence is still grammatically meaningful. With this analogy, a viral escape is the one capable of making the immune system falsely flag the mutant as a harmless sequence (change in sentiment), while the mutant preserves the virus evolutionary structure (grammatically correct). Therefore, among all possible mutations in a viral sequence, we search for viral escapes which result in a high semantic change and high gramaticality in our model. Figure 8 depicts an example of all possible mutations in the test sequence. Following this idea (constrained semantic change search (CSCS) [16]) we first train our model on a corpus of viral sequences in an unsupervised fashion, then take a given viral protein sequence with its known viral escape mutations and rank the mutations based on their gramaticality and semantic change. In our construction, the semantic change, caused by a mutation, is regarded as the change in the internal representation of the deepest layer in our construction before and after mutation happens. In other words, given the test sequence a = (a 1 , a 2 , , a i , .., a I ) and its mutant counterpartā = (a 1 , a 2 , ,ā i , .., a I ), a semantic change will be noted as ∆ζ = |ζ (K) (a) − ζ (K) (ā)| where K indicates the deepest layer in the hierarchical structure of our model.
According to the CSCS objective, gramaticality can be defined as how probable is a mutation in a, i.e, the probability value that model evaluates for a mutation. With this definition, one natural definition of gramaticality is the conditional probability p(ā i |a) on the mutationā i in the test protein [16]. In our model however, the joint probability p Φ (a) is the optimization objective which is used to evaluate the input gramaticality. Therefore the final score for each mutation is defined as: (43) Score := ∆ζ + p(ā i |a) Note that throughout evaluating our model on viral escape mutations, we only consider single mutation in test data. We also keep the size of our samples to 32 amino acids, and with only 25 different amino acids as the building blocks of sequences, there will be 768 (24x32) possible mutations. Among those, a small subset will be viral escape mutations that is already given to us. For this experiment we used escape mutations dataset in [5] that indicates 65 out of those 768 mutations are viral escapes. After calculating both sentiment change and grammatically of mutations, we ranked each mutation based on their ranking score in Eq.43. Mutations with highest ranking value will be considered as predicted viral escapes and consequently, lower rankings indicate less probable for a mutation to be a viral escape. Fig 9a and Fig 9b illustrate gramaticality and semantic change of all mutations including the viral escapes (red points) which clearly shows that viral escapes tend to show a high gramaticality.
We also calculated the area under curve (AUC) of the ranking scores of mutations in Fig. 10. Our results clearly indicate that both grammaticality and semantic change quantities have similar impact on the AUC value. In the list of columns, "pos" means the position in sequence to be mutated, "sub" refers the substitution word, "mut" is the mutant amino acid, and "is-escape" shows which substitution is a viral escape.(B) Gramaticality vs Semantic change of all mutations. Red points indicates the viral escape mutations. Note that the graph is not scaled.