Monolith: Circuit-Friendly Hash Functions with New Nonlinear Layers for Fast and Constant-Time Implementations

. Hash functions are a crucial component in incrementally verifiable computation (IVC) protocols and applications. Among those, recursive SNARKs and folding schemes require hash functions to be both fast in native CPU computations and compact in algebraic descriptions (constraints). However, neither SHA-2/3 nor newer algebraic constructions, such as Poseidon , achieve both requirements. In this work we overcome this problem in several steps. First, for certain prime field domains we propose a new design strategy called Kintsugi , which explains how to construct nonlinear layers of high algebraic degree which allow fast native implementations and at the same time also an efficient circuit description for zero-knowledge applications. Then we suggest another layer, based on the Feistel Type-3 scheme, and prove wide trail bounds for its combination with an MDS matrix. We propose a new permutation design named Monolith to be used as a sponge or compression function. It is the first arithmetization-oriented function with a native performance comparable to SHA3-256. At the same time, it outperforms Poseidon in a circuit using the Merkle tree prover in the Plonky2 framework. Contrary to previously proposed designs, Monolith also allows for efficient constant-time native implementations which mitigates the risk of side-channel attacks.


Introduction 1.Hash Functions in Zero-Knowledge Frameworks
Zero-knowledge use cases and particularly the area of computational integrity combined with zero knowledge have seen a rise in popularity in the last couple of years.Many new protocols [GWC19, ZGK + 22, KST22, BC23] and low-level primitives [AGR + 16, AAE + 20, GKR + 21] have been designed and published recently, in an attempt to increase the performance in this setting.The emergence of folding techniques and recursive SNARKs (incrementally verifiable computation, or IVC [Val08]) make it possible to efficiently prove the integrity of complex computations.Proofs covering up to 2 27 operations are known1 whereas SNARK-based verifiable delay functions (VDFs) might require proving up to 2 40 operations [KMT22].A single IVC operation is typically a compact arithmetic computation (polynomial) in a certain prime field or an assertion to some low-degree polynomial predicate.With verifiable computation (VC) programs (also called circuits) being that large and containing cryptographic protocols, more and more programs contain hash functions as subroutines.Hash functions and their underlying permutations are used not only for data integrity checks, but also to instantiate commitment schemes, authenticated encryption [PSS19, CFG + 22], non-interactive proofs based on the Fiat-Shamir transform, and many other techniques.
Hash Functions in IVC Applications.For typical applications of hash functions (e.g., integrity checks) standard choices like SHA-2 or SHA-3 are usually not the bottleneck when considering the algorithmic description of the protocol.However, this is different in IVC applications mentioned above.For hashing and membership proofs in ZK, e.g. in folding schemes [KST22, KS23,BC23] or private mixers [PSS19], the size of hash functions as an arithmetic circuit over a prime field is more important than the "native" software performance (e.g., on an x86 architecture).Several new hash functions have tried to bridge this gap [AGR + 16, AAE + 20, GHR + 23, GKR + 21, SAD20, BBC + 23].
Hash functions may also be used as a commitment tool in IVC frameworks where the underlying commitment scheme is not homomorphic (STARKs being a notable example [BBHR19]).With a prover and a verifier engaging in commit-open protocols over prime fields, this setting requires to efficiently construct a Merkle tree in a prime field domain over large amounts of data.So far, the computations were performed natively on x86 hardware and not (yet) inside a circuit.Here, classical hash functions have been used until recently.
Both cases appear in recursive schemes, in particular in recursive STARKs [COS20], which are an attractive IVC concept due to relatively little overhead and the possibility of parallelism for large or long computations.These schemes are used in an increasing number of applications, including zero-knowledge virtual machines [Fou22,Pol22b,Zha22] and decentralized signature aggregation [But22] protocols as notable examples.In recursive STARKs the computation and its proof are broken into chunks C 1 , C 2 , . . ., C k and subproofs π 1 , . . ., π k such that the proof π i certifies that chunks from C 1 to C i are computed correctly by utilizing the previous proof π i−1 and a proof for C i .In order to create π i , the prover computes a Merkle tree over the witness data and then proves some tree openings in a circuit.Thus, the same hash function is used in the circuit and in the native computation.In this scenario, up to 90% of a prover's computation may be spent on the hash function call and proofs [COS20], and a construction of a function that excels in both areas is a crucial open problem.
Lookups and Small Domains.Two recent developments in IVC are relevant to our design.The first one is the lookup technique.Starting with Plookup, the IVC operations include not only arithmetic expressions but also lookup statements of the form a ∈ T , where T is a table available to the verifier [GW20, PH23,STW24].For some polynomial commitment schemes (but not for FRI), the table may be preprocessed [ZBK + 22, ZGK + 22, EFG22] so that its size does not contribute to the online prover cost.The lookup technique not only reduces the cost of traditional hash functions in circuits but also allows for cheap transformations of high algebraic degree [GKL + 22, SLS + 23]. 2nother improvement is purely technical but nevertheless vital for the performance.It is the use of small prime fields of ≤ 64 bits of special form like 2 k − 1 or 2 m − 2 k + 1 [Pol22a,Pol23,RIS23], which allow for more efficient arithmetic operations.STARK proofs [BBHR19] can use them since they do not require a group where the discrete logarithm problem is assumed to be hard.The performance growth is significant: switching to an efficient 64-bit field improves the performance by a factor of up to 10 for the Poseidon hash function [GKS23].Moreover, the modular reduction for these fields can often be implemented with mere additions and bit shifts, which are vectorizable on modern CPU architectures and faster than in larger and more generic prime fields.Small fields for IVC applications are also prominent in other recent works [HLN23,Hab23].

Our Contributions
We approach the problem of creating a hash function that is simultaneously fast and circuit-friendly in several steps.First we summarize the technical ideas of the new design, and then we introduce the new hash function Monolith.

Efficient Nonlinearity and Compact Circuits over Prime Fields
Our first main contribution is a generic design of components over certain prime fields, which can be implemented with just a few (and possibly vector) constant-time instructions on the x86 architecture, and can be written as a small circuit.This strategy, called Kintsugi, significantly improves upon the ideas behind Reinforced Concrete [GKL + 22] and Tip5 [SLS + 23], yielding faster and constant-time-friendly S-boxes.These new S-boxes are defined by first splitting a field element into smaller bit arrays.Then, constant-timefriendly S-boxes using Daemen's χ function and similar ones [Dae95] are applied to these arrays, which can be parallelized with fast vector instructions and implemented as lookup tables in circuits.Finally, the outputs are assembled back to a field element with no overflow or collision, which is asserted in circuits with minimal overhead.

Low-Degree Components with Provable Differential Bounds
Our second contribution is a concept of using a Feistel Type-3 [ZMI89] function together with an MDS layer.It can be seen as a replacement for the power function x d from Poseidon [GKR + 21] and similar constructions.The advantage is that we can use faster squaring operations (i.e., x 2 ) instead of more expensive (as d must be coprime with p − 1) power functions over F p , and simultaneously obtain low-degree predicates in circuits.
Notably, x → x 2 is not invertible over F p , and hence we cannot use this component to build an invertible SPN.However, we can exploit a Feistel scheme to make the entire construction invertible.A discussion regarding the risks of using non-bijective components for designing symmetric primitives in which the internal state is not obfuscated by a secret can be found in [Gra23].
Although the Feistel layer alone is known to have weak diffusion, we show that together with an MDS matrix it comes close to a regular SPN.To the best of our knowledge, we are the first to prove the results on the differential properties of the component using a strategy analogous to the wide trail design [DR02].In particular, we prove lower bounds on the number of active nonlinear functions in trails.Similar to extended generalized Feistel networks introduced in [BMT13], we believe that this result and its possible extension to Feistel structures of other types may be useful in the design of any symmetric primitive, including those for more classical settings (as already happened for the Lilliput cipher [BFMT16]).

Monolith: Fast, Constant-Time, Circuit-Friendly
All of these techniques lead us to the design of Monolith, a family of permutations which are efficient in native software, in hardware, and inside of circuits.This permutation can  5, the numbers for Monolith-64, Poseidon, and Poseidon2 are taken for the 64-bit prime field and a state size of t = 12.Proof (IVC) timings are benchmarks for a proof of preimage knowledge (Table 7).Numbers for SHA3-256 and SHA-256 are extrapolated from a circom implementation using R1CS [Bal23].
then be turned into a hash function and other permutation-based schemes.3Construction of Monolith.Our scheme has a few rounds using three different components.We adopt the naming convention of Reinforced Concrete.
The first component is Bricks (Section 4.4), which is instantiated with a Feistel Type-3 construction with square mappings.The second component is Concrete (Section 4.5), which is the multiplication with a circulant MDS matrix.Together with Bricks it provides the diffusion necessary to protect against statistical attacks.The third and last component is Bars (Section 4.3), which is based on the Kintsugi outlined above.We prove that each such Bar operation has a high degree and provides high security against algebraic attacks.The Bar function is applied only to a few field elements in each round.
The combination of these three components provides security against statistical and algebraic attacks while allowing for an efficient implementation.Our initial analysis has found a 3-round attack on a weakened version, and also suggests that all potential attacks should stop at 4 rounds.Since improvements are expected, we set the number of rounds uniformly to 6.
Performance.We give an extensive comparison between our new proposal and its competitors in Section 7. Our benchmarks confirm that the native performance of Monolith is comparable to SHA-3, which makes it the first circuit-friendly compression function achieving this goal.At the same time, Monolith is efficient within IVC systems.In contrast to Reinforced Concrete, Monolith also allows for a constant-time implementation without significant performance loss.
A performance overview is given in Fig. 1.We test the IVC performance on Plonky2 [Pol22a], a popular choice for FRI-based proofs.Compared to Tip5, Monolith is around twice as fast and gives the user more freedom regarding the choice of the prime number (including the recent 31-bit prime used in [RIS23] due to advantageous implementation characteristics).Moreover, compared to the widely used Poseidon permutation, Monolith shows a native performance improvement by a factor of around 15. Finally, Monolith allows for an efficient circuit implementation, since it can be represented by a low number of degree-2 constraints, leading to a faster performance compared to Poseidon when implemented in Plonky2 (see Table 7).

Fast and Circuit-Friendly Functions over F p
When working over F p , informally, we cannot just split a field element into smaller chunks, process them independently, and then reassemble.This is due to the fact that the field size is a prime and thus cannot be represented as a product of smaller domains.
To solve this problem, we present a generic strategy for specific prime numbers.Elements of it can be found in earlier works on Reinforced Concrete [GKL + 22] and Tip5 [SLS + 23].The main principles are as follows.
1. Split the integer form of a field element into chunks according to carefully chosen boundaries aligned with the sum of the powers of two such that the resulting chunks fit a lookup table in a ZK circuit.
2. Identify the combination of chunk values that never appear due to the fact that p is not a power of two.

Design intra-chunk transformations S i such that
• impossible chunk combinations never appear (e.g. by making some chunk values fixed points), and • they can be implemented in constant time, for example with an AndRX (ANDrotation-XOR) transformation [AJN14].
4. Combine the chunks back into a large element, after a possible shuffle (only operations guaranteeing that the output element is in the field are possible).
We call this strategy Kintsugi. 4An illustration is shown in Fig. 2.

Chunks and Buckets
In order to formally define the Kintsugi strategy, we need to introduce some notations.For a prime p ≥ 5, we define p ′ as where • || • denotes concatenation, that is, it consists of alternating sequences of ones and zeroes.The first sequence is always a 1-sequence, while the last one can be either a 0-or a 1-sequence.
Given the lengths of 1-chunks ν 1 , ν 2 , . . ., ν ξ and the lengths of 0-chunks µ 1 , µ 2 , . . ., µ ξ (both from left to right), and (1) For efficiency, we may split each chunk into sub-chunks, called buckets.Each S-box will then work independently on each bucket.To obtain simple conditions for invertibility, we require the buckets to be aligned with chunk boundaries, i.e., we require that buckets do not cross boundaries between chunks.We formalize this in the following.Definition 2. Let p be a prime with 1-and 0-chunks defined by Eq. (1) and T = {τ 1 , . . ., τ s } be a bucket decomposition, i.e., some positive integers τ 1 , . . ., τ s such that We say that the bucket decomposition T is aligned with p ′ if for every i ∈ {1, 2, . . ., ξ} there exist k i , l i such that This means that for every i the i-th 1-chunk covers buckets from k i to l i (exclusive).Such buckets are called 1-buckets.Further, the i-th 0-chunk covers buckets from l i to k i+1 (exclusive).These are called 0-buckets.This decomposition is illustrated (with small buckets) in Figure 3.
Finally, we impose that the buckets are not too small, in order to avoid potential security issues.Indeed, the number of fixed points and/or invariant subspaces for Kintsugi becomes too large when the buckets are too small.Definition 3. The bucket decomposition is efficient if τ i ≥ 3 for each i ≥ 1.
This condition puts a constraint on p.However, we believe it is satisfied by the majority of the primes used in cryptography, including the ones used in our work.We highlight that we worked with p ′ directly instead of p, since this just given efficiency condition is never satisfied by p ≥ 5 if p = 1 mod 4.

The Kintsugi Bar
The nonlinear component Bar, based on Kintsugi, is defined as follows.Let τ 1 , τ 2 , . . ., τ s be an efficient and aligned bucket decomposition for p as in Eq. (1).Then, for C, S, and D described in the following, the component operates as (2) Decomposition D. The decomposition D, i.e., over integers, where ρ S = 0 and ρ i = j>i τ j .As the bucket decomposition is aligned, we get that each bucket is either a 1-or 0-bucket.
S-Boxes S. The operation S applies s invertible S-boxes in parallel, i.e.,

S(x
where S i : Z 2 τ i → Z 2 τ i and we require that 1 τi is a fixed point if S i operates on a 1-bucket of p ′ , and Hence, a z-chunk of p ′ must be mapped via S i into a z-chunk, where z ∈ {0, 1}.
Composition C. The final operation C is the inverse of the decomposition.Given where ρ 1 = 0 and ρ i = j>i τ j .

Well-Definition and Bijectivity
Here we prove that our C • S • D(•) defined in Eq. (2) and in particular its S components are invertible and well-defined.
Proposition 1.Let p be a prime and {τ i } the bucket decomposition aligned with p ′ .Then Kintsugi (Eq.(2)) with the S-boxes satisfying Eq. (4) is bijective over F p .
Proof.We consider the natural extension of the transformation C • S • D(•) to the domain Z 2 ρ and denote it by T .Then we proceed in two steps.First we prove that T is bijective over Z 2 ρ .Then we prove that for any x < p we have T (x) < p.These two facts imply the result.
Transformation T .We define T : , where D ′ is a generalization of D that takes inputs from Z 2 ρ instead of Z p , i.e., Further, S is defined as before and C ′ is the inverse of D ′ (basically, it corresponds to C without the modular reduction).Bijectivity of T .This follows from the fact that D ′ , S, and C ′ are bijective.
Field Invariant of T .Finally, we have to prove that ∀x ∈ {0, . . ., p − 1} : T (x) ∈ {0, . . ., p − 1}.Let us start by analyzing the case x = p − 1.If p − 1 = p ′ (i.e., p ≡ 1 mod 4), then all S-boxes act as identity functions (due to Eq. ( 4)), and thus T (x) = x < p. Instead, if x = p − 1 ̸ = p ′ , then D(x) differs from D(p ′ ) in the last bucket: the former ends with 10 and the latter with 11.As 2 τs − 1 is a fixed point of the S-box S s , we get that S s (x ′ s ) < 2 τs − 1 = z s and so T (x) < p ′ ≤ p. Next, let us consider the case x < p − 1.Consider the binary form of x, and let b be the most significant bit in which it differs from p ′ .Clearly, b is in a 1-bucket of p ′ with some index i.Note that for each j < i all S-boxes S j act as identity functions, that is, The two previous facts together with T being bijective imply that T (x

Considerations about the Kintsugi Strategy
Due to the link between F τi 2 and F 2 τ i , almost any invertible AndRX transformation works well for S and can be implemented in constant time as its components are basic x86 operations.Here we give some examples for p = 2 n − 1.
• Bit Shuffle.Clearly, both 1 τ and 0 τ are fixed points under the bit shuffling operation for any τ .Moreover, it is essentially for free in hardware.
• Efficient Linear Operations.Linear operations over F τ 2 of the form with nonzero i ̸ = j, where ≪ denotes the circular shift operation and ⊕ denotes the logical XOR operator, are (i) invertible for odd τ and (ii) result in 1 τ and 0 τ being fixed points.
• Efficient Nonlinear Operations.Nonlinear operations over F τ 2 such as for odd τ , where x := x ⊕ 1 τ and ⊙ denotes the logical AND operator, are also possible.This corresponds to the χ-function [Dae95, Table A.1] already used in Keccak/SHA-3, which is known to be invertible for gcd(τ, 2) = 1.Moreover, 1 τ and 0 τ are fixed points.
An additional bit rotation may be needed to reduce the number of fixed points.

Bars in Kintsugi and Reinforced Concrete
There are various differences between the Kintsugi strategy just described and the Bars functions proposed in Reinforced Concrete (and later used in Tip5).In Reinforced Concrete an element of F p is represented as a vector from • We rely on the structure of the prime p. Thanks to its composition of a few powers of two, the decomposition now is simply a bit extraction rather than a chain of modular reductions, which is expensive both natively and inside the proof system.The bijectivity of Kintsugi is guaranteed under the minor and easily satisfied condition that some specific inputs are fixed points.
• The S-boxes of Reinforced Concrete or Tip5 do not have a simple representation and must be implemented as tables both for native and circuit computations.The Kintsugi strategy instantiates the S-boxes with AndRX transformations, which are fast and constant-time in native x86 implementations but can easily be transformed to table lookups for circuits.

Side-Channel Leakage and Countermeasures
Lookup tables in symmetric primitives are a well-known source of side channel leakage due to cache timing.When confidential information is processed (e.g., committing to coin secrets with ZK hash functions in privacy-preserving payment systems), an adversary may recover a large portion of it from timing differences of lookups into memory or caches.These techniques are well-known since at least two decades in the context of encryption [Pag02, Ber05, OST06], and the high-level ideas have found first applications in zeroknowledge proof systems [TBP20].The lookup-oriented designs Reinforced Concrete and Tip5 use specific tables for which a constant-time implementation with reasonable overhead is nontrivial.It is thus of utmost importance to have a design where lookups can be replaced with constant-time operations.

Statistical and Algebraic Properties
Here we prove a generic statement that links algebraic and statistical properties of mappings over F p , which we will use in the security analysis of Monolith.
Lemma 1.Let p ≥ 3 be a prime number, and let F sq denote the squaring function x → x 2 over F p .Let F sq be any interpolant of F sq over F ⌈log 2 p⌉ 2 , i.e., for any a < p and its bit representation a we have that F sq (a) is the bit representation of F sq (a).Then F sq has (multivariate) degree at least d, where d is the maximum positive integer such that d < log 2 √ p and 2 d−0.5 is odd. 5 Proof.We prove this result by contradiction.Suppose that the degree of F sq is smaller than d.Then the XOR sum of its outputs over any hypercube of dimension d is equal to zero [Lai94], including the hypercube Whenever this number is odd, F does not XOR to 0 at the 2d-th least significant bit, which contradicts the previous fact.As a result, the squaring has at least degree d if 2 d−0.5 is odd and d < log 2 √ p.
Lemma 2 (Differential).Let F be a function that maps F p to itself with a differential and so it has a degree of at least α • p.
As the degree of the polynomial G is smaller than the degree of F by 1, we obtain that deg(F) > α • p.
Lemma 3 (Linear Approximation).Let F be a function that maps F p to itself such that there exists a linear approximation (a, b) with probability 0 < β < 1, that is, Proof.By definition, the equation Similar to before, we conclude that F has degree at least equal to β • p.
Based on the previous result, we can immediately conclude the following.
Corollary 1.Let F be a function that maps

Feistel Type-Layer and the Wide Trail Strategy
The Kintsugi Bar is nonlinear but we will see in Section 6.1 that its high algebraic degree comes at the cost of weak differential properties.Thus if being used in an SPN construction, it would make it vulnerable to statistical (e.g., differential) cryptanalysis.For this reason, we introduce another nonlinear component defined as a Feistel Type-3 network [ZMI89] that complements Kintsugi Bars.By instantiating it via low-degree functions, it will allow us to provide strong argument for guaranteeing resistance against differential and statistical attacks in general.We follow the naming convention of Reinforced Concrete (the first lookup-based ZK-friendly hash function) where the nonlinear layer providing protection against statistical attacks is called Bricks, and use the same name for uniformity.
Feistel Type-3.The Feistel Type-3 network is a member of a larger Feistel family [HR10], which has been largely neglected in favour of SPN schemes in block cipher and hash function design, primarily for its complexity and worse diffusion properties.As already recalled in the introduction, a potential drawback of SPN schemes regards the fact that their invertibility depends on the fact that all their internal components are invertible as well.As it is well known, this is not the case of Feistel networks, which remain invertible independently of the details of their internal functions.For many prime order groups used in SNARKs, the smallest invertible power mapping is x 5 .As a result, we have found the Feistel Type-3 network instantiated with square maps x → x 2 to be particularly attractive as it is cheaper in circuits and, most importantly, its blend with an MDS layer yields statistical properties similar to those in regular SPNs.With nonlinear F i , Bricks F for t elements x 1 , . . ., x t is defined as where, in contrast to the original description, the swap of the wires is omitted.Further diffusion is instead handled by a matrix multiplication.
Diffusion Layer.While Bricks F alone does not provide fast diffusion, a combination with a matrix layer increases the diffusion properties [BMT13,BFMT16].This approach is well-known in the SPN design as the wide trail strategy [DR01], where a lower bound for the number of "active" nonlinear components in any differential trail is proven, leading to strong arguments against differential attacks.
Here we follow this line of research, and for the first time we derive bounds for the SPN structure where the nonlinear layer is a Feistel Type-3 function.For this, we work with matrices of Maximum Distance Separable (MDS) Codes for maximizing the number of active F p -words over two consecutive rounds.
Our New Bound.Now we obtain our main result on the differential properties of the Feistel-Type-3-MDS combination.Our new bound improves the ones recently proposed in [Gra23] for an analogous (but different) scheme.
Proposition 2. Consider an R-round construction, where each round consists of the application of Bricks F over F t q as in Eq. (5) followed by the multiplication with a t × t MDS matrix.The minimum number ĉ of active functions F i in any differential trail satisfies Proof.Denote the number of active words in the input and the output of the i-th Bricks F layer by a i and b i , respectively.Then we exploit two properties.
• Each active input word x i to Bricks F activates F i if i < t, hence a words activate at least a − 1 functions F i .
• Each active output word y i of Bricks F implies that Hence b words activate at least b−1 2 functions.With the MDS property, which states that b k + a k+1 ≥ t + 1 for each k ≥ 1, we obtain the following for the number c k of active functions F i in round k: for r rounds.Summing each two consecutive inequalities for c i , we obtain with the last inequality being the MDS property.W.l.o.g., let us find a bound for ĉ := c 1 + • • • + c R where all c i are non-negative real values satisfying Eq. ( 7).First, the optimal {c i } make all inequalities equal.Indeed, suppose that 2c j + c j+1 > t − 1 but for all k > j we have 2c k + c k+1 = t − 1.Then by using Then we observe that in the optimal {c i } it should hold that c R = 0. Indeed otherwise we apply the same trick by setting Thus, the minimum is achieved by c R = 0 and Substituting these values into the formula for ĉ, we obtain The weaker bound follows from (−2) 1−R ≤ 1 2 for R > 1. Remark 1 (On the Design Rationale).Our choice of Feistel versus SPN is purely performancedriven: fewer non-constant field multiplications in the former when using x → x 2 .However, neither Feistel Type-3 nor Type-26 alone would provide good statistical properties [HR10].Notably, the combination of Type-2 with an MDS layer would not allow us to derive optimal bounds regarding the number of active nonlinear functions in such a simple and elegant way either.

Specification of Monolith
Monolith is a family of permutations which can be used within hash functions and other constructions.They use prime fields F p with two options for p, namely p Goldilocks = 2 64 − 2 32 + 1 and p Mersenne = 2 31 − 1. (8) The permutation Monolith-64 is defined over p Goldilocks with the state consisting of t = 8 or t = 12 elements.The permutation Monolith-31 is defined over p Mersenne with the state consisting of t = 16 or t = 24 elements.

Modes of Operation
Monolith supports sponge modes and a 2-to-1 compression function.
Sponge-Based Schemes.First, Monolith can instantiate a sponge [BDPV07, BDPV08] and thus various symmetric constructions such as variable-length hash functions, commitment schemes, authenticated encryption, and stream ciphers.The recently proposed SAFE framework [AKMQ22, KBM23] instructs how to handle domain separation and padding in these constructions.In a sponge, the permutation state is split into an outer part with a rate of r elements and an inner part with a capacity of c elements.As we uniformly suggest a security level close to 128 bits, we set c = 256 ρ and r = 2c.
Concretely, it takes t F p elements as input and produces t/2 F p elements as output.It is defined as , where Trunc t/2 yields the first t/2 elements of the inputs.This compression function can be used in Merkle trees and has recently also been applied in similar constructions, including Anemoi [BBC + 23], Griffin [GHR + 23], and Poseidon2 [GKS23].For a security level of close to 128 bits, we set t = 512 ρ , i.e., t = 8 for the 64-bit field and t = 16 for the 31-bit field (factually yielding slightly less than 128 bits).

Permutation Structure
The Monolith permutation is defined as where R is the number of rounds and R i over F t p are defined as where Concrete is a linear operation, Bars and Bricks are nonlinear operations over F t p , c (1) , . . ., c (r−1) ∈ F t p are pseudo-random round constants, and c (r) = ⃗ 0. Note that a single Concrete operation is applied before the first round.A graphical overview of one round of the construction is shown in Fig. 4.

Bars
The Bars layer is defined as for a t-element state, where u ∈ {1, . . ., t} denotes the number of Bar applications in a single round.We select u such that u • log 2 p ≈ 256, i.e., the nonlinear part occupies around 256 bits of the state.Each Bar application is defined as where C, S and D are the operations defined in Section 2. In the following, we describe them individually for Monolith-64 and Monolith-31.
Operations D and C. We use a decomposition into 8-bit values such that The composition C is the inverse operation of the decomposition D.
S-Boxes S. In Eq. (3) we set s = 8.Then all S i over F8 2 are defined as where ≪ is a circular shift (here we interpret an integer as a big-endian 8-bit string) and y is the bitwise negation (cf.landscape *100 in [Dae95, Table A.1])7
Operations D and C. The decomposition D is given by S-Boxes S. In Eq. (3) we set s = 4 using {8, 7}-bit lookup tables.Then, for y ∈ F 8 2 and y ′ ∈ F 7 2 , the S-boxes are defined as (cf. the *01 landscape in [Dae95, Table A

Round Constants
The round constants c t for the i-th round are generated using the well-known approach of seeding a pseudo-random number generator and reading its output stream.In particular, we use SHAKE-128 with rejection sampling, i.e., we discard elements which are not in F p .SHAKE-128, thereby, is seeded with the initial seed "Monolith" followed by the state size t and number of rounds R, each represented as one byte, the prime p represented by ⌈log 2 (p)/8⌉ bytes in little endian representation, and the decomposition sizes in the bar layer, where each s i is represented as one byte.As concrete examples, the seed is for Monolith-31 with t = 16 and R = 6, where b'X indicates that X is to be interpreted as a bytes literal, i.e., as a sequence of bytes each prefixed by \x.

Number of Rounds and Security Claims
We design the Monolith permutation to be used in a sponge mode or in a compression mode.Attacks against either of them should require the attacker to use work in the order of ≈ 2 128 , where a slight deviation from this number is due to the chosen prime.To reach this goal with Monolith, and based on our analysis of statistical and algebraic attacks, we suggest using R = 6 rounds for both Monolith-64 and Monolith-31 (see Table 1) and claim 2 log 2 (p Goldilocks ) ≈ 128 and 4 log 2 (p Mersenne ) ≈ 124 bits of security for Monolith-64 and Monolith-31, respectively.We further claim that a sponge hash function or compression function based on Monolith (both H for brevity) with these values makes it hard to find with less than ≈ 2 128 operations, where the approximation is due to the chosen prime field.
Remark 2. We do not claim that the Monolith permutation does not have any nongeneric property (or "indifferentiable from random").In particular, we do not consider certain permutation distinguishers -such as the integral one [DKR97] or the zero-sum partitions [KR07, BCC11] -that have not ever resulted in collision or preimage attacks for similar designs.

Security Analysis
The numbers of rounds are conservatively chosen based on the security analysis proposed in Section 5 and Section 6.As some of the components or combinations are new, our analysis contains several nontrivial ideas and may be of separate interest to cryptanalysts and designers.
First, in the spirit of the wide trail strategy, [DR02], we prove tight bounds for the number of active squarings in differential characteristics for the Type-3 Feistel-MDS combination in Section 5.1.We also study rebound attacks in Section 5.4, a research direction that is often missed in the ZK hash function design.We demonstrate practical attacks on a reduced version of Monolith and argue the security of the full version.
Using differential and linear properties of Bar, we prove lower bounds on its algebraic degree in Section 6.1, which implies resistance against algebraic attacks after a few rounds.In this regard, we additionally study the complexity of Gröbner basis attacks on toy versions of Monolith with smaller primes but still realistic Bars layers in Section 6.3.
To summarize, we are not able to even break 5 rounds of the proposed scheme with any basic attacks proposed in the literature.As future work, we encourage to study reduced-round and/or toy variants of our design.

Differential Attacks
Given pairs of inputs with some fixed input differences, differential cryptanalysis [BS90] considers the probability distribution of the corresponding output differences produced by the cryptographic primitive.
Let ∆ I , ∆ O ∈ F t p be respectively the input and the output differences through a permutation P over F t p .The differential probability (DP) of having a certain output difference ∆ O given a particular input difference ∆ I is equal to In the case of iterated schemes, a cryptanalyst searches for ordered sequences of differences over any number of rounds that are called differential characteristics/trails.Assuming the independence of the rounds, the DP of a differential trail is the product of the DPs of its one-round differences.
Since the Bars layer is not supposed to have good statistical properties, we simply assume that the attacker can skip it with probability 1.As the maximum differential probability of the square map is 1/p, Proposition 2 and (6) immediately imply the following bound.
Corollary 2. Any 4-round differential characteristic for Monolith has a probability of at most p −9(t−1) 8 .
As a result, even assuming that more characteristics can be used simultaneously in order to set up a differential attack, a differential-based collision attack on 5 rounds looks infeasible.

Linear Attacks
Linear cryptanalysis [Mat93] exploits the existence of linear approximations.For primitives over binary fields, the attack makes use of the high correlations [DGV94] between sums of input bits and sums of output bits.The generalization of this attack over prime fields has been proposed in [BSV07,DGGK21].We claim that our scheme is secure against this approach, due to the low correlation of the map x → x 2 (as for the case of differential attacks).

Rebound Attacks
Rebound attacks [MRST09] have been widely used to analyze the security of various types of hash functions against shortcut collision attacks since the beginning of the SHA-3 competition.It starts by choosing internal state values in the middle of the computation, and then computing in the forward and backward directions to arrive at the inputs and outputs.It is useful to think of it as having central (often called "inbound") and the above mentioned "outbound" parts.In the attack, solutions to the inbound phase are first found, and then are filtered in the outbound phase.Whereas it is not possible to prove the resistance to the rebound attacks rigorously, we can provide some meaningful arguments to demonstrate that they are not feasible.The inbound phase deals with truncated and regular differentials.By Corollary 2 we see that a solution for a 5-round differential cannot be found, and so the inbound phase cannot cover more than 4 Bricks layers.In the outbound phase, the Concrete layers that surround these Bricks layers make all differentials diffuse to the entire state, so that the next Bricks layers destroy all of those.We hence conclude that 6 rounds of Monolith are sufficient to prevent rebound attacks.In the following, we describe a rebound collision attack on 3-round (Weakened) Monolith.
Rebound Collision Attack on the 3-Round (Weakened) Monolith.The best rebound attack that we have found is a near-collision attack on the reduced 3-round permutation without the Bars layer.We show how to find a state that satisfies a differential ∆ 1 → ∆ 8 for certain ∆ 1 , ∆ 8 which are equal in the last F p word, i.e., ∆ 1,t = ∆ 8,t .As a concrete application, this yields a zero difference in this word for the compression function x → Tr t/2 (P(x) + x), which is a near-collision.
The inbound phase covers 3 layers of Bricks separated by 2 Concrete layers: To find such a state pair, we apply the following approach.
2. The inbound phase covers the expansion of ∆ 2 to t words and back to the 2-word difference ∆ 7 = [0, 0, . . ., 0, δ 2 , δ 3 ].Note that we have ∆ 6 = [0, 0, . . ., 0, δ 2 , δ 4 ].We arbitrarily set δ 2 , δ 3 such that ∆ 8,t = ∆ 1,t and then choose δ 4 such that 3. As a result, the differential path for the full 3-round scheme is established, and we determine the state.The (δ 3 , δ 4 ) differential determines the input word x t−1 of the third Bricks layer, and the equation determines input words x 1 , x 2 , . . ., x t−1 of the second Bricks layer.Note that this is a system of linear equations, and by solving it we can determine the full state.
Overall we obtain a partial collision at a negligible cost (the cost for solving the linear system of equations can be approximated by Ω(t 3 ), which is much smaller than the cost for constructing the collision in the case of a random permutation approximated by p 1/2 ).We are not aware of any possible extension of such attack to more rounds and/or including Bars, which is left as an open problem for future work.

Other Statistical Attacks
We claim that 6 rounds are sufficient for preventing other statistical attacks as well.Here we provide argument to support such conclusion for one of the most powerful statistical attacks against a hash function, that is, the rebound attack.For that goal, we propose an analysis of the number of the fixed points and of the truncated differential characteristics.

Fixed Points
Contrary to Reinforced Concrete, the Bars layer of Monolith has very few fixed points.Both local maps x ⊕ (x ≪ 1) ⊙ (x ≪ 2) ⊙ (x ≪ 3) and x ⊕ (x ≪ 1) ⊙ (x ≪ 2) have about (7/4) n fixed points (for even and odd n, respectively) when considered over F n 2 (a bit value is preserved if the product of nearby bits is 0).However, all of them except 0 and 1 = 2 n − 1 are destroyed by the circular shift (verified experimentally).
For comparison, we recall that a Bar of Reinforced Concrete has 2 134.5 fixed points out of 2 254 possibilities.Hence, the probability of encountering a fixed point is approximately 2 −119.5•3= 2 −358.5 for Bars.At the current state of the art, we are not aware of any attack that exploits these fixed points.

Invariant Subspace Attacks
An invariant subspace attack exploits the existence of a subspace X ⊆ F t p that remains invariant under the round function.(Note that we do not require that the coset of the subspace stays the same.)The attack is particularly effective either in the case of keyed ciphers instantiated with weak keys [LAAZ11,LMR15] and in the case of partial SPN schemes, in which part of the state remains unchanged after the application of the nonlinear layer.In the latter case, the linear layer and the round constants can be carefully chosen in order to break the invariant subspaces, as shown in detail in [GRS21, GSW + 21].The wide trail bound implies that such subspaces can not be based on high probability differentials: any r round differential must activate at least (t − 1)(r − 1)/3 squarings, thus making the differential probability prohibitively low for r > 3.
At the same time, for our design, while statistical attacks are prevented by our arguments for Bricks and Concrete, and it can can be based on truncated differentials as in [GRS21, GSW + 21].There, primitives with a partial nonlinear layer (P-SPN) are analyzed, which may be relevant to the Feistel-based Bricks layer.Concretely, an invariant subspace shall include a linear subspace where words corresponding to the S-box inputs take all possible values and the other words are fixed.Such subspaces propagate through the nonlinear layer with probability 1 and, if they are smaller than the full domain, can be used for an attack.Again, this does not apply to our construction for the following reasons: • as proved in [Gra23, Sect.4.2.1],there exists no non-trivial invariant subspace for x → x 2 in F p .Hence, the subspace can exist only in the case in which the inputs of the square functions are either constant or fully active; • suppose we take a linear subspace where some inputs to squarings take all possible values.Then it appears that even when a single squaring is activated, the application of the Concrete layer results in an affine space covering the whole state due to the MDS property.This, in turn, activates all other squarings, so that the resulting subspace has maximum dimension t, which makes it trivial.
As a result, we claim that our design is not vulnerable to invariant subspace attacks.It follows that the approach used in [BCD + 20] for Poseidon in which an attacker exploits invariant subspaces to set all inputs of Bars to known and precomputed constants, forcing the operation applied to subspace inputs to be of low degree, does not work.

Truncated Differentials
Truncated differential attacks [Knu94] are used mostly against primitives that have incomplete diffusion over a few rounds.This is not the case here since (i) Bricks is a full nonlinear layer, and (ii) the Concrete matrix is MDS.We have not found any other attacks where a truncated differential can be used as a subroutine either.

Non-Applicable Attacks
We emphasize that we do not claim security of Monolith against zero-sum partitions [BCC11] (which can be set up via higher-order differentials [Knu94, BCD + 20] and/or integral/square attacks [DKR97]).In such an attack, the goal is to find a collection of disjoint sets of inputs and corresponding outputs for the given permutation that sum to zero (i.e., satisfy the zero-sum property).Our choice is motivated by the fact that, to the best of our knowledge, it is not possible to turn such a distinguisher into an attack on the hash and/or compression function.For example, in the case of SHA-3/Keccak [Nat15, BDPA11], while 24 rounds of Keccak-f can be distinguished from a random permutation using a zero-sum partition [BCC11] (that is, full Keccak-f ), preimage/collision attacks on Keccak can only be set up for up to 6 rounds of Keccak-f [GLL + 20].Due to this, Keccak's designers decided to reduce the security margin of Keccak by defining a 12-round version called "KangarooTwelve" [BDP + 18].By taking these facts into accounts, and as already done in similar work [GKR + 21, GHR + 23], we ignore zero-sum partitions for practical applications.Same conclusion holds for other attacks, such as the impossible differential [BBS99] and the zero-linear correlation one [BW12].Focusing on the impossible differential one, the attacker exploits differentials that hold with probability zero.As for the case of the zero-sum partitions, we are not aware of any collision or (second) pre-image attack based on impossible differential characteristics.

Security Analysis: Algebraic Attacks
Cryptanalytic successes such as Gröbner basis attacks on Friday and Jarvis [ACG + 19], attacks on MiMC combining higher-order differential distinguishers with polynomial factorization [EGL + 20, BCP23, LP19, RAS20], or an attack on Grendel [GKRS22] leveraging polynomial factorization are a stark warning that a thorough analysis of such attack vectors is important.While the use of Bars is intuitively expected to frustrate such attacks, it is nevertheless essential to establish a sound basis for arguments against such attacks.

Degree of the Bars Polynomials
We have verified experimentally that for n ≤ 8 there exists a modular differential for the S-box Eq. ( 10) with probability almost 1/4: Proposition 3. Let n > 4 be such that gcd(n, 3) = 1.Let S be the invertible map over F n 2 given Eq.(10), that is, x → x ⊕ (x ≪ 1) ⊙ (x ≪ 2) ⊙ (x ≪ 3) ≪ 1.Let S be the corresponding mapping but over Z 2 n , where the elements of F n 2 are viewed as the big-endian counterparts of elements from Z 2 n .
We have verified experimentally that for n ≤ 8 there exists a modular differential for the S-box Eq. ( 11) with probability 1/16: Proposition 4. Let n > 4 be such that gcd(n, 2) = 1.Let S ′ be the invertible map over F n 2 given Eq.(11), that is, x → x ⊕ (x ≪ 1) ⊙ (x ≪ 2) ≪ 1.Let S ′ be the corresponding mapping but over Z 2 n , where the elements of F n 2 are viewed as the big-endian counterparts of elements from Z 2 n .
The proof is identical to that of Lemma 4. With Lemma 2, we obtain the following bound on the degree of Bar.
In the following, we describe our practical results on toy-Bars functions defined on smaller prime fields.As expected, they show that the corresponding interpolation polynomial is dense and of high (usually, maximum or close to maximum) degree.

Degree and Density: Practical Results
Evaluating the actual density of the polynomial resulting from Bar applied to a single field element in F p , where p ∈ {2 64 − 2 32 + 1, 2 31 − 1}, is infeasible in practice.Indeed, any enumeration and subsequent interpolation approach would take far too long.Therefore, in our experiments we focus on smaller finite fields defined by "similar" prime numbers.In particular, we focus on n-bit primes of the form 2 n − 2 η + 1 for η as close to n as possible.We then apply the S-box S i to smaller parts of the field element, exactly as in Bar where the S-box is applied to each 8-bit part of the larger field element.We also vary the sizes of the parts to which the S i are applied in order to get a broader picture.
The results of our evaluation are shown in Table 2.For example, in the first case, where p = 2 8 − 2 4 + 1, S i is applied to the first 4 bits (starting from the least significant bit) and then to the next 4 bits, covering the entire field element.The size of these parts is indicated in the second column.As we can see, the maximum degree is reached for all tested primes of the form 2 n − 2 η + 1, where η > 1.Moreover, for these primes, the density is always close to 100%, mostly matching it.We also applied S i to elements of F 2 n −1 directly, where n ∈ {5, 7, 13}, which resulted in almost maximum-degree polynomials of low density (specifically, only 6, 18, and 629 monomials exist in the polynomial representation, respectively).This suggests that increasing the number of S-box applications per field element (i.e., increasing the number of smaller parts to which S i are applied) is beneficial for the density of the resulting polynomial.
We also evaluated the degrees and density values resulting from the inverse S-boxes applied to the field elements, in order to get an estimation of the algebraic strength of the inverse operation.The results match the results given in Table 2, where always more than 99% monomials are reached together with a degree close to the maximum.

Results for F t
p .We also ran tests regarding the density over the entire state.Naturally, this task gets harder with an increased number of rounds, since the degrees are rising too quickly.In our tests we focused on p ∈ {2 8 − 2 4 + 1, 2 7 − 1} and t = 4, and we give the results together with the sizes of the smaller S-boxes in Table 3.
As can be seen, the maximum number of monomials is almost reached after a single round.We suspect that some of the monomials are not reached due to cancellations, which is reasonable when considering these small prime fields.Still, we acknowledge this fact by adding another round on top of that in order to ensure that all polynomial representations Table 3: Degree and density of the polynomials after a single round, where t = 4 and two input variables are used (with the other two input elements being fixed).

p
Bit splittings Degree Density of the state are dense and of maximum degree.Thus, having 6 rounds achieves 4 rounds of security margin regarding degrees and density of polynomials.

Security against Algebraic Attacks via Bars
Here we consider attacks that exploit the fact that several rounds of the permutation do not have maximum possible algebraic degree.For this, we interpret the output elements as polynomials of the input elements.Then we formulate a collision or a preimage attack as a system of equations and try to solve it.

Interpolation Attacks
Interpolation attacks [JK97] exploit the degree of a component to reconstruct its polynomial and solve a system of equations.However, we have demonstrated that the degree of the Bar component is close to p. Therefore, after at most 2 rounds of Monolith, the degree in each variable becomes almost p, which implies that mounting the attack is infeasible.
Impact of Invariant Subspaces.Note that the Bars layer is partial, using only u Bar components.Thus, excluding the Type-3 Feistel layer, it may be possible to pass r rounds by guessing r • u intermediate variables.However, as u ≥ t/3, and due to the analysis proposed in Sect.5.4.2, this is possible for at most 2 rounds (without exhausting the degrees of freedom).We conclude that it is not feasible to apply simple algebraic attacks on 4 or more rounds of Monolith.

Solving a CICO Problem with Univariate polynomials
In the CICO (constrained input/constrained output) problem, the goal is to find a solution to the system of v polynomial equations of t − v input variables (as the remaining v ones are set to zero).We formalize it in the following.
Definition 4 (CICO Security).A permutation P : The univariate system appears if v = t − 1 or we guess t − v − 1 variables.Note that our guess may be invalid if the number of equations exceeds the number of variables, so we have to repeat the guess p v−1 times.Note also that p is smaller than 2 128 so p v−1 may still be feasible.
• If v = 1 and we have guessed t − 2 variables, then we have to solve a single polynomial equation faster than in time p.The degree of the polynomial reaches p after 2 applications of the Bars layer, i.e., after 2 rounds.Therefore, solving the equation will require time ≈ p.
• If v > 1, and we have guessed t − v − 1 variables, then the probability that a CICO solution exists for a particular guess is p −(v−1) , since we only solve one equation and hope for other v − 1 to hold.A system of polynomial equations has degree close to p, so solving it would cost at least p time for any guess.Multiplying by the number of guesses, we obtain that the total complexity still exceeds p • p v−1 = p v .

Solving the Multivariate CICO Problem with Gröbner Bases
In a general case, we model the CICO problem as a system of multivariate polynomial equations generating a zero-dimensional ideal.The main technique for solving these systems is to use Gröbner and apply the following steps.
1. Compute a Gröbner basis for the zero-dimensional ideal of the system of polynomial equations with respect to the degrevlex term order.
3. Factor the univariate polynomial in the lex Gröbner basis and determine the solutions for the corresponding variable.Back-substitute those solutions, if needed, to determine solutions for the other variables.
The total complexity of a Gröbner basis attack is hence the sum of the respective complexities of the above steps.We argue that even step 1. is prohibitively expensive for Monolith.
The complexity of computing a Gröbner basis with (matrix-based) algorithms such as Lazard [Laz79,Laz83], F4 [Fau99], or Matrix-F5 [BFS15] for an equation system with n e equations in n v variables over a field F can be bounded by operations in F. Here, d solv denotes the solving degree and ω denotes the linear algebra exponent.Intuitively, d solv corresponds to the maximum degree attained during a Gröbner basis computation.Thus, the overall complexity of computing a Gröbner basis can be understood as bounded by row-reducing (full-rank) matrices of size n e • nv+i−1 i × nv+i−1 i , for i = 0, 1, . . ., d solv , eventually, leading to the bound in Eq. (12).In practice, the Macaulay matrices built during a Gröbner basis computation might be sparse and have a substantial rank defect, and Eq. ( 12) does not account for this particular structure in the Macaulay matrices.

Rationale for our Security Arguments
As a conservative choice and to account for the structured Macaulay matrices in the algebraic model for Monolith, in Eq. ( 12) we drop any factors from the asymptotic O(•) notation and set n e = ω = 1, and, hence, use for estimating the complexity of actual Gröbner basis computations.We stress that setting ω = 1 is a highly optimistic scenario from an attacker's viewpoint.Establishing concrete estimates for C GB , hence, boils down to bounding the solving degree d solv .This task is in general a difficult problem in its own regard, often as hard as actually computing a Gröbner basis.However, for the special case of (semi-)regular sequences, there exist bounds on d solv .In particular, for regular sequences d solv is upperbounded by the Macaulay bound [BFS15] Informally, the case regular sequences can be regarded as a generic case, formalizing the notion of "random polynomial systems".Although the assumption of regular sequences often fails for algebraic models of circuit-friendly primitives, comparing a given algebraic model with this generic case can still be an informative approach and help to establish heuristic estimates for the complexity of Gröbner basis computations when practical experiments are infeasible.In our analysis, we compare the actual solving degree d solv from our practical experiments with d Mac .This allows us to extrapolate trends from the aquired data points to large-scale instances, which are computationally intractable.
When analyzing a given algebraic model, another problem is scalability: it is nontrivial to properly scale down the original system of equations to some small-scale variant that is solvable on a standard machine.We tackle this problem and estimate the complexity of a Gröbner basis attack on the CICO problem for full-scale Monolith as described below.We point out that we only focus on step 1. of a Gröbner basis attack and show that already the complexity of this step exceeds the generic CICO security level.
• We consider a small-scale, weakened version of one round of Monolith, denoted SmallWeak1R, with a small state of only t = 4 elements, and u=2 Bar functions in the Bars layer.We have where for Concrete ′ we use the circulant matrix M = circ(2, 1, 1, 1), which is not MDS and thus weaker than the MDS matrix used in Monolith.For Bricks, we use the same Bricks as described in Section 4.4, with t = 4.The Bars function is the same function described in Section 4.3, with t = 4 and a decomposition into m = 2 buckets for all small primes for which we run actual computations, see also Table 4.
For the S-Box functions inside Bar, we use suitable functions from [Dae95, Table A.1].
• We use the following CICO problem, called SmallWeak1R-CICO, in our analysis: find • We suggest an arguably optimal model for SmallWeak1R-CICO, denoted by the same name, as a system of polynomial equations.
• For various small primes, we run actual GB computations on the model SmallWeak1R-CICO and observe that for these small-scale instances • Extrapolating heuristically, we argue that the complexity of computing a Gröbner basis for SmallWeak1R-CICO, also for larger primes, is around For the original, full-sized primes p Goldilocks and p Mersenne , this yields a complexity estimate for solving SmallWeak1R-CICO via Gröbner basis techniques of 2 154 operations in F p for p = p Goldilocks , and 2 93 operations for p = p Mersenne .Compared to the generic CICO-security level of 2 64 and 2 31 function calls for p Goldilocks and p Mersenne , respectively, our analysis suggests ample security margin against Gröbner basis attacks on SmallWeak1R-CICO.
Impact of Invariant Subspaces.As for the of the interpolation attack, we recall that it is not possible to impose all Bars to be constant, since • the degrees of freedom are not sufficient, and • there exists no invariant subspace that results in such scenario, as showed in Sect.5.4.2.

Optimal Model for SmallWeak1R-CICO
Algebraic Model for Bar.We suggest the following algebraic model for Bar for a decomposition of a prime field element into m buckets with sizes 2 s1 , 2 s2 , . . ., 2 sm : Here, b 1 = 1 and b i := 2 s1+•••+si , for 2 ≤ i ≤ m, and L i : F p → F p is the interpolation polynomial over F p of degree ( si −1 for the S-box S i given by The resulting system consists of m + 2 equations, namely m equations of respective degrees 2 s1 , . . ., 2 sm , 1 equation of degree max i (2 si − 1), and 1 equation of degree 1.The m + 2 variables are x 1 , . . ., x m , x, y.
The function SmallWeak1R = Concrete ′ • Bricks • Bars • Concrete ′ is a small-scale and weakened version of one round of Monolith defined on t = 4 words and u = 2 Bar functions in the Bars layer.For Concrete ′ , we use the circulant matrix M = circ(2, 1, 1, 1), which is not MDS and thus weaker than the MDS matrix in Monolith.Our algebraic model for SmallWeak1R-CICO, denoted by the same name, is given by the following system of equations: Here, H(•) i denotes the i-th element of the output of the function H for i ∈ {1, 2, 3, 4}.We note that each Bar function decomposes a prime field element into m = 2 buckets, hence, w i = Bar(u i ) denotes above algebraic model for Bar with a decomposition into m = 2 buckets.The resulting equation system consists of 10 equations with • 4 equations for each Bar system w i = Bar(u i ), i = 1, 2, and • 2 equations for modelling the CICO constraint at the input and the output.
In total, we have 10 variables, namely u 1 , u 2 , u 3 , u 4 , w 1 , w 2 and 2 internal variables for each Bar system.

Results of our Gröbner Basis Experiments
Based on the (heuristic) estimate presented in Eq. ( 15), we argue that one round of full Monolith given by 1R := Concrete • Bricks • Bars • Concrete provides ample security against Gröbner basis attacks as well.Intuitively, it is reasonable to assume that an increased state size and/or an increased field size do not make the attacks more efficient (given the same ratio of CICO constraints and Bar applications).
In more detail, let 1R-CICO denote the following CICO-problem for 1R: find where 0 v denotes a v-tuple with all entries being zero.For p Goldilocks , we have t = 12, v = 4, and for p Mersenne we have t = 24, v = 8.This amounts to a generic CICO-security level of 2 256 and 2 248 function calls, respectively.Extrapolating Eq. (15), we arrive at an estimated Gröbner basis complexity for 1R-CICO of 2 334 operations in F p for p = p Goldilocks , and 2 420 operations for p = p Mersenne .We summarize the results of our Gröbner basis analysis in Table 4.
Discussion of Gröbner Basis Experiments.The results of our Gröbner basis experiments on small-scale instances of SmallWeak1R-CICO, described in Eq. ( 14), are depicted in Table 4.We conducted our experiments on a machine with an Intel Xeon E5-2630 v3 @ 2.40GHz (32 cores) and 378GB RAM under Debian 11 using Magma V2.26-2.
For the maximum degree d solv reached during a Gröbner basis computation, we see that the ratio d Mac : d solv is higher than 4.Moreover, C = C GB (n v , d Mac /4) can be seen as a lower bound for the actual computation time T .

Alternative Representations
Several constructions were cryptanalyzed using a specially crafted algebraic representation of internal components [BBLP22, BBL + 24].As this is a heuristic process, the absence of such representations is difficult to guarantee.On our side, we have investigated in this section the most effective representation we have found.
The exact attacks from those papers do not seem applicable here.In [BBLP22] the authors find a strategy to skip 2 rounds of some SPN-like permutation by making use of statistical properties rather than algebraic one.In more details, one round is skipped due to the fact that the considered permutations not start with a matrix multiplication, but rather with an S-box layer.An extra round is skipped by exploiting the fact the scheme is an SPN scheme instantiated with S-boxes that are power maps, and so (a • x) d = a d • x d .In our case, the permutation starts with a matrix multiplication, the bar layer is not defined via power maps, and bricks is a Feistel network, so the result does not apply.
In [BBL + 24], the attacker can exploit a similar trick to skip some initial rounds of Griffin.Roughly speaking, a proper choice of the inputs make some rounds to be linear, due to the particular Horst structure used in there.In our case, the bars layer is only partial, hence, it is possible to skip some of them by choosing an appropriate subspace.

Algebraic Attacks over F 2
We also consider algebraic attacks working over the binary field F 2 , due to the low degree of Bars in this setting.Here we demonstrate that the squaring operation of Bricks has a high degree as a multivariate polynomial over F 2 .
Since 2 d−0.5 is odd for d = 15 and d = 30, Lemma 1 implies the following bound on the degree of the squaring function over F 2 .Proposition 6.Let p ∈ {p Mersenne , p Goldilocks } (8).Let F sq be an interpolant over F ⌈log 2 p⌉ 2 of the squaring operation F(x) = x 2 over F p .Then F sq has degree (multivariate over F 2 ) at least d, where (i) d = 30 for p = 2 64 − 2 32 + 1, and (ii) d = 15 for p = 2 31 − 1.
Since Bars is of degree 2 over F 2 , and since Concrete is a nonlinear function over a binary field, we claim that Monolith is secure against algebraic attacks instantiated over the binary field.

Native Performance
We compare the performance of Monolith and competitors in Table 5.All benchmarks were taken on an AMD Ryzen 9 7900X CPU (singlethreaded, 4.7 GHz).
We included implementations of Monolith into the framework in [IAI21], and also added instantiations of widely popular Poseidon [GKR + 21], its modification Poseidon2 [GKS23], and also Griffin [GHR + 23] with p = 2 64 − 2 32 + 1 following their original instance generation scripts. 10e benchmark these hash functions with a state size of t = 8 for the compression mode and of t = 12 for the sponge mode in order to have a fair comparison.We also compare against Tip5 with its fixed state size of t = 16 using the implementation from [SLS + 23],11 and against Tip4 ′ , a faster instance of Tip5 with a fixed state size t = 12, using the implementation from [Sal23]. 12We also compare against Reinforced Concrete instantiated with the scalar field of the BN254 curve, and against SHA3-256/SHA-256 as implemented in RustCrypto. 13The constant-time versions of Tip5 and Reinforced Concrete is our modification of the original code, which may not be optimized, thus it is given as an estimate.
Finally, we compare Monolith-31 with Poseidon and Poseidon2 over the p Mersenne prime field and state sizes of t = 16 and t = 24 (again for sponge and compression mode), as well as for a constant time implementation (constant time F p operations and no lookup Most interestingly, the performance gap between arithmetization-friendly hash functions and traditional ones is now closed, with SHA3-256 being slower than Monolith-64 with t = 8 and only faster by 21 ns than Monolith-64 in the sponge mode with t = 12.While we acknowledge that SHA3 achieves a higher throughput when measured in cycles per byte (cpb) due to its larger rate of 1088 bits, the somewhat lower rate of Monolith-64 is chosen to best fit the ZK use cases described in this paper.For completeness, we mention that SHA3-256, as given in Table 5, achieves a performance of 6.56 cbp, while Monolith-64 achieves 9.53 cbp and 15.45 cbp in compression and sponge mode, respectively.
Regarding Monolith-31 for the 31 bit Mersenne prime field we observe that we still get a fast native performance with 210 ns for t = 16.This is significantly faster than Tip5 which has the same state size, but is implemented with the larger 64 bit prime field.Only for t = 24 we observe a slower native performance which is due to the usage of a 32 × 32 circular MDS matrix in the Concrete layer, which we use to be able to implement it via a radix-2 FFT (see Note on MDS Matrices below).However, competing designs, such as Tip5 also rely on MDS matrices and thus will either suffer from the same performance loss, or if they come up with better matrices/implementations, these can be used in Monolith-31 as well.Nonetheless, one can observe that Monolith-31 is still faster than the closest competitor for the same field and state size, i.e., Poseidon2, by 300 ns.
Unlike other lookup-based designs, Monolith does not rely on lookup tables and its structure allows for constant-time implementations without significant performance loss.The binary χ-like layer can be efficiently implemented using a vectorized implementation that does not require an explicit (de-)composition, while unrolling the lookup-tables containing repeated power maps in Reinforced Concrete, Tip5, and Tip4 ′ adds considerable workload to the computation.Thus, the overhead of going to a constant-time implementation only consists of supporting constant-time prime field arithmetic for Monolith, which can help in efficiently preventing side-channel attacks such as the ones proposed in [TBP20].Using a constant-time reduction leads to a slight slowdown in our comparison.However, the resulting runtimes are still significantly faster than the non-constant-time runtimes of other circuit-friendly hash functions, such as Poseidon and Griffin, and Tip4 ′ for t = 8 and t = 12.Moreover, a constant-time Monolith-64 in compression mode is still faster than SHA3-256 for t = 8 (although we acknowledge the different sponge rates and security margin of the two constructions).
Finally, for completeness, we give the runtime of each part of the Monolith permutation for both a constant-and variable-time version in Appendix C.
Note on MDS Matrices.We use matrix multiplications based on fast Fourier transforms and circulant matrices for the linear layer of Monolith.For t ∈ {8, 12, 16} we use matrices whose dimensions correspond to the state size.However, for t = 24, we use a circulant matrix of dimension 32 × 32 [HS24]. 14This allows us to efficiently employ a radix-2 algorithm.In more detail, if the input to the linear layer is (x 1 , . . ., x 24 ), the output is defined by (y 1 , . . ., y 24 ) T = Trunc 24 (M × (x 1 , . . ., x 24 , 0, . . ., 0 8 zeroes where M ∈ F 32×32 and Tr(•) n yields the first n elements of the input.While the multiplication uses a 32 × 32 MDS matrix, the final output will be the result of the multiplication by a 24 × 24 (non-circulant) MDS matrix, since every submatrix of an MDS matrix is also MDS.This approach leads to an advantage of around 15% compared to the naive multiplication with a generic 24 × 24 matrix.

Performance in Proof Systems
A modern zero-knowledge proof system defines arithmetization rules for the circuit it attempts to prove.Most new proof systems support the Plonkish arithmetization, where all input, output, and intermediate variables are put into a witness matrix W with a fixed number of rows and columns.The data in each row is restricted by polynomial equations determining the values and computations used.One of these generic equations of degree 2 is a i x 1 x 2 + b i x 3 + c i x 4 + d i = 0, where a i , b i , c i , d i are public constants for the i-th row [GWC19].The arithmetization allows for different tradeoffs w.r.t. the number of columns, variables being used, and the final degrees.Additionally, various tuples within a row may be constrained to a set of values in a predefined table T.
A precise comparison of different arithmetizations is hard without implementing and testing.However, a significant part of the work is to construct s degree-ρ polynomials for the witness columns and to prove that they satisfy the polynomial equations.The total work is then estimated as an element in O(d • ρ • s), where d is the maximum degree of a row polynomial.The cost of using table lookups for FRI-based schemes is currently equivalent to the use of a single polynomial of degree t = max{ρ, |T|}.
In this section we give possible arithmetizations for translating Monolith into a set of Plonkish constraints and refer to Appendix B.1 for R1CS constraints.Our Plonkish arithmetization is designed to accommodate lookup constraints capable of efficiently looking up 8-bit values.If the proof system is able to use larger tables (e.g., 16-bit ones), then multiple lookup constraints can be combined into just one larger constraint, reducing the total number of constraints.sj y ′ i , while also making sure that the limbs in the decomposition correspond to field elements.For p Goldilocks , this means enforcing that either the least significant 32 bits of Bar's input are 0 or the most significant bits are not all 1, i.e., For p Mersenne = 0x7fffffff we need to make sure that the combined values are ̸ = p, which is equivalent to them not being 2 8 − 1 (three) or 2 7 − 1 (one), i.e., We describe the application of s individual S-boxes with s lookup constraints (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x s , y s ).These also include the range checks for each input which are also necessary for the correctness of the constraints above.

Plonkish Arithmetization. Each composition
Apart from 2s lookup variables per Bar, we define u variables at the output of the first Concrete layer (these are the inputs to the Bars layer) and t variables at the output of each of the following Concrete layers (except for the last one).The reason is that the variables after the first Concrete layer store linear relations in the input, and only the u variables entering the Bars layer are needed.For the last layer, the output variables can be used directly.In total, we have 6 • (2us + u) + 5t + u variables, where {u = 4, s = 8} for the p Goldilocks case and {u = 8, s = 4} for the p Mersenne case (considering S-boxes of ≈ 8 bits).
In Table 6 we compare the (non-optimized) arithmetization of Monolith with the ones of other 64-bit designs (see Appendix B.2 for details).To achieve a fair comparison, we do not apply any constraint or witness optimization but try to follow the same approach.We see that both the number of lookups and constraints in Monolith is slightly larger than in Tip5 and Tip4', but the constraint degree is smaller by the factor of 3.5, which should result in an overall decrease of the prover time by a factor of at least 2 (estimated as area-degree product).This is reasonable since Tip5 and Tip4' are able to process more field elements with a permutation call.Poseidon, Poseidon2, and Rescue-Prime due to their comparably small witness size and no lookup tables are estimated to still provide faster proving performance, closely followed by Monolith-64 with its low-degree nonlinear layers.Again, we stress that these numbers are derived from non-optimized arithmetizations and are subject to change.For example, one can leverage the low degree of Monolith to reduce witness size by trading with a larger degree round function.We refer to Appendix B.3 for details.Furthermore, these estimates are based on a simplified performance metric (area-degree-product) which does not consider every aspect of prover performance, and benchmarks in real proof systems might differ.
Benchmarks in Plonky2.We implemented Monolith-64 in the Plonky215 proof system to verify the estimations of Table 6.16 Plonky2 uses FRI commitments and hence works well with small prime fields.Since it already comes with a custom gate of Poseidon in sponge mode (t = 12) where the entire gate is put into just one row of the trace, we implement Monolith-64-sponge with the same parameters.To highlight the main advantage of Monolith-64, namely its fast native performance, we benchmark proving a Monolith-64 permutation while using Monolith-64 as the hash function to build the Merkle trees.Similarly, we benchmark Poseidon when using Poseidon as the hash function (which is the default setting in Plonky2).The results can be seen in Table 7.
One can observe that since Monolith requires more witnesses than Poseidon and both gates use just one row in the trace, the resulting proof is larger.However, the combination of proving Monolith-64 while using it at as the Plonky2 hash function leads to half the prover and verifier runtime compared to Poseidon.

Figure 1 :
Figure 1: Comparison of hash functions in various settings (logarithmic scale).The native benchmarks ("Time", "Const Time") are from Table5, the numbers for Monolith-64, Poseidon, and Poseidon2 are taken for the 64-bit prime field and a state size of t = 12.Proof (IVC) timings are benchmarks for a proof of preimage knowledge (Table7).Numbers for SHA3-256 and SHA-256 are extrapolated from a circom implementation using R1CS[Bal23].

Figure 3 :
Figure 3: Chunk and an aligned bucket decomposition of the number 52860.Here ξ = 3 and s = 7.

Figure 4 :
Figure 4: One round of the Monolith construction, where x i , y i ∈ F p .

Table 2 :
Degree and density of the polynomials resulting from Bar applied to various field elements.

Table 4 :
Results Gröbner basis computations on several instances of SmallWeak1R-CICO, described in Eq. (14), for various small primes p, decomposition into m = 2 buckets with bucket sizes 2 s1 , 2 s2 , and extrapolation to 1R-CICO.Here, n e and n v denotes the number of equations and variables, respectively.The degree d solv denotes the maximum degree reached during a GB computation with Magma.T is the runtime in microseconds (10 −6 ).For the complexity C we use the estimate C = C GB (n v , d Mac /4).Extrapolated estimates are in italic.

Table 5 :
Native in nano seconds (ns) of different hash functions for variable and constant time implementations.Benchmarks are given for one permutation call, i.e., hashing ≈ 500 bits for all but SHA functions.Estimates are in italic.
′ , the fastest lookup table based design, is also slower by a factor of 1.9 when using Monolith with the compression mode, and also slower by 36 ns compared to Monolith with the same state size t = 12.

Table 6 :
Plonkish arithmetization comparison for various schemes.The numbers are for a single permutation.

Table 7 :
Proving performance in Plonky2 using Goldilocks and sponge mode.