Universal and Succinct Source Coding of Deep Neural Networks

Deep neural networks have shown incredible performance for inference tasks in a variety of domains, but require significant storage space, which limits scaling and use for on-device intelligence. This paper is concerned with finding universal lossless compressed representations of deep feedforward networks with synaptic weights drawn from discrete sets, and directly performing inference without full decompression. The basic insight that allows less rate than naïve approaches is recognizing that the bipartite graph layers of feedforward networks have a kind of permutation invariance to the labeling of nodes, in terms of inferential operation. We provide efficient algorithms to dissipate this irrelevant uncertainty and then use arithmetic coding to nearly achieve the entropy bound in a universal manner. We also provide experimental results of our approach on several standard datasets.


Universal and Succinct Source Coding of Deep Neural Networks Sourya Basu and Lav R. Varshney , Senior Member, IEEE
Abstract-Deep neural networks have shown incredible performance for inference tasks in a variety of domains, but require significant storage space, which limits scaling and use for on-device intelligence. This paper is concerned with finding universal lossless compressed representations of deep feedforward networks with synaptic weights drawn from discrete sets, and directly performing inference without full decompression. The basic insight that allows less rate than naïve approaches is recognizing that the bipartite graph layers of feedforward networks have a kind of permutation invariance to the labeling of nodes, in terms of inferential operation. We provide efficient algorithms to dissipate this irrelevant uncertainty and then use arithmetic coding to nearly achieve the entropy bound in a universal manner. We also provide experimental results of our approach on several standard datasets.
Index Terms-Neural network compression, source coding, entropy coding, artificial neural networks.

I. INTRODUCTION
D EEP learning has achieved incredible performance for inference tasks such as speech recognition, image recognition, and natural language processing. Deep neural networks, however, are enormous cloud-based structures that are often too large and too complex to perform fast, energy-efficient inference on device. Even in the cloud, size and complexity of neural networks limits deployment. Compression, with the further capability of inference without full decompression, is therefore important. Here, we develop new universal source coding techniques for feedforward deep networks having synaptic weights drawn from finite sets that essentially achieve the entropy lower bound; we also compute the entropy bound. Further, we provide an algorithm to use these compressed representations for inference tasks without complete decompression. 1 Structures that can represent information near the entropy bound while also allowing efficient operations on them are called succinct structures [6], [7], [8], [9]. Thus, we provide a succinct structure for feedforward neural networks.
Neural networks are composed of nodes connected by directed edges. Feedforward networks (multilayer perceptrons) have connections in one direction, arranged in layers. An edge from node i to node j propagates an activation value a i from i to j, and each edge has a synaptic weight w ij that determines the sign/strength of the connection. Each node j computes an activation function g(·) applied to the weighted sum of its inputs, which we can note is a permutation-invariant function: for any permutation π . Nodes in the second layer are indistinguishable.
The first main contribution of this paper is determining the entropy limits, once the appropriate invariances are recognized. Next, to design an appropriate "sorting" of synaptic weights to put them into a canonical order where irrelevant uncertainty due to invariance is removed; a form of arithmetic coding is then used to represent the weights [30], [31]. Note that the coding algorithm essentially achieves the entropy bound. The third main contribution is an efficient inference algorithm that uses the compressed form of the feedforward neural network to calculate its output without completely decoding it, taking only O(N) additional dynamic space for a network with N nodes in the layer with maximum number of nodes. Finally, the paper provides experimental results of our compression and inference algorithms on feedforward neural networks trained to perform classification tasks on standard MNIST, IMDB, and Reuters datasets.
A preliminary version of this work only dealt with universal compression and not succinctness [1].

A. Organization
The remainder of the paper is organized as follows. Section II discussed the basic structure and invariant properties of a feedforward neural network (multilayer perceptron), and how it can be decomposed into substructures that we call partially labeled bipartite graphs and unlabeled bipartite graphs. Sections III and IV provide entropy bounds, universal compression algorithms, and inference algorithms that need not require full decompression for partially labeled bipartite graphs and unlabeled bipartite graphs, respectively. Section V provides two different compression algorithms for full neural networks, based on algorithms from Sections III and IV. Section V also provides an efficient inference algorithm for the full network based on the inference algorithm in Section III. Section VI provides experimental results for the compression algorithms and Section VII concludes.

II. FEEDFORWARD NEURAL NETWORK STRUCTURE
Consider a K-layer feedforward neural network with each (for notational convenience) layer having N nodes, such that nodes in the first layer are labeled and all nodes in each of the remaining (K − 1) layers are indistinguishable from each other (when edges are ignored) due to the inferential invariance in (1). Suppose there are m possible colorings of edges (corresponding to synaptic weights), and that connections from each node in a layer to any given node in the next layer takes color i with probability p i , i = 0, . . . , m, where p 0 is the probability of no edge. The goal is to universally find an efficient representation of this neural network structure. We will first consider optimal representation for two smaller substructures that form the layers of feedforward neural networks (after recognizing the invariance), and then return to the problem of optimally representing the full network.
Let us define the two aforementioned substructures: partially-labeled bipartite graphs and unlabeled bipartite graphs, see Fig. 1.
Definition 1: A partially-labeled bipartite graph consists of two sets of vertices, U and V. The set U contains N labeled vertices, whereas the set V contains N unlabeled vertices. For any pair of vertices with one vertex from each set, there is a connecting edge of color i with probability p i , i = 0, . . . , m, with p 0 as the probability the two nodes are disconnected. Multiple edges between nodes are not allowed.
Definition 2: An unlabeled bipartite graph is a variation of a partially-labeled bipartite graph where both sets U and V consist of unlabeled vertices. For simplicity in the sequel, we assume there is only a single color for all nodes in unlabeled bipartite graphs and that any two nodes from two different sets are connected with probability p.
To construct the K-layer neural network from the two substructures, think of it as made of a partially-labeled bipartite graph for the first and last layers and a cascade of K −2 layers of unlabeled bipartite graphs in between. An alternative construction is: the first two layers are still a partially-labeled bipartite graph but then each time the nodes of an unlabeled layer are connected, we treat it as a labeled layer, based on its connection to the previous labeled layer (i.e., we label the unlabeled nodes based on the nodes of the previous layer it is connected to), and iteratively complete the K-layer neural network.

III. PARTIALLY-LABELED BIPARTITE GRAPHS: ENTROPY, REPRESENTATION, INFERENCE
We first compute the entropy bound for representing partially-labeled bipartite graphs, then introduce a universal algorithm for approaching the bound, and finally an inference algorithm that need not fully decompress to operate.

A. Entropy Bound
Consider a matrix representing the edges in a partiallylabeled bipartite graph, such that each row represents an unlabeled node from V and each column represents a node from U. A non-zero matrix element i indicates there is an edge between the corresponding two nodes of color i, whereas a 0 indicates they are disconnected. Observe that if the order of the rows of this matrix is permuted (preserving the order of the columns), then the corresponding bipartite graph remains the same. That is, to represent the matrix, the order of rows does not matter. Hence the matrix can be viewed as a multiset of vectors, where each vector corresponds to a row of the matrix. Using these facts, we calculate the entropy of a partially-labeled bipartite graph. To that end, we define the following terms.
Definition 3: Let B(N, p) be a fully-labeled random bipartite graph model in which graphs are randomly generated on two sets of vertices, U and V, having N labeled vertices each. The edges in B(N, p) are chosen independently between any two vertices belonging to different sets with probability p.
Definition 4: Let B p (N, p) be a partially-labeled random bipartite graph model generating graphs in the same way as a random bipartite graph model, except that the vertices in the set V in the generated graphs are unlabeled.
Definition 5: We say a bipartite graph, b, is isomorphic to a partially labeled bipartite graph b p if b p is obtained by removing labels from all vertices in set V of b, keeping all edge connections the same. The set of all bipartite graphs, b, isomorphic to a partially-labeled bipartite graph, b p , is denoted by I(b p ).
Definition 6: The set of automorphisms of a graph, Aut(b) for b ∈ B(N, p), is defined as an adjacency-preserving permutation of the vertices of a graph; |Aut(b)| denotes the number of automorphisms of a graph b. Here adjacency-preserving permutation is simply interpreted as relabeling of the nodes without changing the edges between nodes.
Definition 7: A graph g is called asymmetric if |Aut(g)| = 1; otherwise it is called symmetric.
Our entropy characterizations of random bipartite graphs follow [29] on entropy of random graphs. The notation a b means b = o(a). We will make use of Lemma 17, stated and proved in the Appendix, in the supplementary material, on symmetry of random bipartite graphs. Proof: For a randomly generated bipartite graph, b ∈ B(N, p) with k edges, we have Considering only the permutations of vertices in the set V, we have a total of N! permutations. Given that each partially-labeled graph b p corresponds to |I(b p )| number of bipartite graphs, and each bipartite graph b ∈ B(N, p) corresponds to |Aut(b)| (which is equal to |Aut(b p )|) number of adjacency-preserving permutations of vertices in the graph, from [32], [33] one gets: By definition, the entropy of a random bipartite graph, The entropy of a partially-labeled graph is: where b is a bipartite graph isomorphic to the partially labeled graph b p . Now [34] shows that for all p satisfying the conditions in this theorem, a random graph G(N, p) on N vertices with edges occurring between any two vertices with probability p is symmetric with probability O(N −w ) for some positive constant w. Lemma 17 in the Appendix, in the supplementary material, provides a similar result on symmetry of random bipartite graphs which will be used to compute its entropy.
Note that |Aut(b p )| = 1 for asymmetric graphs, so Therefore, Hence, for any constant w > 1, This completes the proof. We can also provide an alternate expression for the entropy of partially-labeled graphs with m possible colors that will be amenable to comparison with the rate of a universal coding scheme.
Lemma 1: The entropy of a partially-labeled bipartite graph, with each set containing N nodes and edges colored with m possibilities is 1 p i and the K i s are some non-negative integers that sum to N.
Proof: Recall the adjacency matrix of a partially-labeled bipartite graph is nothing but a multiset of vectors. We know Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the empirical frequency of all elements of a multiset completely describes it [26]. Each cell of the vector can be filled in (m + 1) ways corresponding to m colors or no connection (color 0), hence there can be (m + 1) N possible vectors in total. Let us enumerate these (m +1) N vectors using a lexicographical order based on the colors of these vectors. Let K i be the random variable denoting the number of appearances of the ith vector in this order. Noting that Here, π i is the probability of occurrence of each of the possible vectors. In the ith vector, let the number of edges with color j be n ij . Then, π i = m j=0 p n ij j . Hence, the entropy of the multiset is: where n (n 0 ,n 1 ,...,n m ) represents the random variable denoting the number of vectors having n ij edges of color j. By linearity of expectation and rearranging terms, we get: Algorithm 1 Compressing a Partially-Labeled Bipartite Graph 1: Set d = 0 2: Initialize an (m + 1)-ary tree of depth N with no node values 3: Set the root node value as IntC(N) 4: while d ≤ N do 5: for each of the nodes at the current depth do 6: Form m + 1 child nodes of the current node 7: Set the node value of the child node of color i as AC(α i ) 8: end for 9: Set d = d + 1. 10: end while

B. Universal Lossless Compression Algorithm
Let us introduce Algorithm 1, a universal algorithm for compressing a partially-labeled bipartite graph based on arithmetic coding, and analyze its performance. The high-level idea is to parse the bipartite graph into a tree-structured form that only represents the required invariant properties-reminiscent of sorting a sequence into a multiset and only storing the multiplicity of colors rather than their specific order. This is equivalent to representing the vector (k 1 , . . . , k (m+1) N ), which was introduced in the proof of Lemma 1. The tree essentially holds the multiplicity values of the discrete weights of the edges of the bipartite graph, cf. [28]. Once the tree structure is formed (and its properties are represented using an integer code), arithmetic coding is used to store the multiplicity values in the tree. Notations and the overview for Algorithm 1 follow next. Pseudocode for the algorithm itself is given in the Algorithm 1 environment.
1) Notations for Algorithm 1: Let the output of an integer code and an arithmetic code be denoted by IntC(·) and AC(·) respectively. Each node of the (m + 1)-ary tree consists of node values that correspond to stored information about the compressed graph. Let d be the variable in the algorithm that corresponds to the depth of the tree we are currently encoding. Let α represent the node value of the current node being processed. If the current node lies in depth d of the tree, then α i represents the number of vectors with the dth column having color i and all previous columns from 1 to d having the same colors in the same order as that of the ancestor nodes of the child node starting from the root node.
2) Overview of Algorithm 1: We first initialize an (m + 1)ary tree of depth d with no node values. The (m + 1)-ary structure of the tree corresponds to m possible colors of the edges and the case with no edge. We encode the number of vectors, N, as the root node value using an integer code and set the variable d to 0. We encode the node values of this tree in a breadth-first manner. For each depth, we process all the nodes and encode the node values of its child nodes.
At any given time, call the node being processed as the current node. The current node is encoded with the number, say, α. We use an arithmetic code to encode the child node of color i with the number α i . Here, The arithmetic coding of these nodes is done sequentially in the natural order obtained while processing the nodes in a breadth-first manner and encoding the child nodes with colors in ascending order.
Lemma 2: Let L be the number of bits Algorithm 1 takes to represent a partially-labeled bipartite graph; then Proof: We know, for any node encoded with α with the encodings of its child nodes (α 0 , α 1 , . . . , α m ), that (α 0 , α 1 , . . . , α m ) is distributed as a multinomial distribution, M(α 0 , α 1 , . . . , α m ; α, P). So, using arithmetic coding to encode all the nodes, the expected number of bits required to encode all nodes is α i ! is the probability of observing child nodes with node values α i s given the parent node has node value α. Thus, (2) is simply the expected value of the sum of bits needed for encoding each node of the tree obtained from Algorithm 1. This, because it is well known that arithmetic coding takes [ i∈{0,...,m} log 2 1 p i ] expected number of bits for encoding a random variable taking values (0, . . . , m) with probabilities (p 0 , . . . , p m ).
Here, the summation is over all non-zero nodes of the (m + 1)-ary tree. Hence (2) simplifies as is summed over all nodes, then all terms except those corresponding to the nodes of depths 0 and N + 1 cancel, i.e., p i , since in the adjacency matrix of the graph, each cell can have colors from 0 to m with probability p i , and for each color i, the expected number of cells having color i is N 2 p i . Thus, we find Since we are using an arithmetic coder, it takes at most 2 extra bits [35,Ch. 13.3]. Theorem 2: The expected compressed length generated by Algorithm 1 is within 2 bits of the entropy bound.
Proof: The result follows from Lemma 1 and Lemma 2 by comparing the entropy expression of a partially-labeled random bipartite graph with the expected length when using Algorithm 1.
As per Theorem 2, space saving from the algorithm is close to the theoretical limit, which depends on N. Hence the entropy bound directly gives us the obtained space saving, which can be as much as N log N for large N for partially labeled bipartite graphs with each layer having N nodes. Since the size of while i ≤ m and f > 0 do 10: Decode child node of f corresponding to color i as c, where c, L = AD(L) 11: Encode the value of c back in L 1 using arithmetic coding as L 1 = AC(c, L 1 ) 12: Enqueue c in Q 13: if j equals 0 and at least one non-zero node has been processed at the current depth then 16: end while 20: end while 21: Update the Y vector using the required activation function the graph is O(N 2 ), the fraction of bits saved reduces as N increases. On the other hand, for small values of N, the theoretical limit does not allow us to save around N log N bits. Hence there is a tradeoff between the amount of bits and the fraction of bits saved. For small N, a greater fraction of bits is saved, whereas as N increases, the fraction of bits saved decreases but the amount of bits saved increases.

C. Inference Algorithm
Algorithm 1 achieves near-optimal compression, but we also wish to use partially-labeled bipartite graphs as two-layered neural networks without fully decompressing. Algorithm 2 directly uses compressed graphs for inference in two-layered neural networks. The key idea to the algorithm is that arithmetic decoding operates sequentially and that the permutationinvariant weighted summation operation of neural network inference (1) can be done in the same way. Since the encoding was breadth-first, the decoding/inference also accumulates the weighted sum in a breadth-first manner.
Structures that take space essentially equal to the information-theoretic minimum while also supporting various relevant operations on them are called succinct structures [8] as defined next. Algorithm 2 performs inference directly on the compressed representation, so we achieve succinctness.
Definition 8: If L is the information-theoretic minimum number of bits required to store some data, then a structure is succinct if it represents the data in L + o(L) bits, while allowing relevant operations on the compressed data.
Notations for Algorithm 2 and an overview are given below. Pseudocode for the algorithm itself is given in Algorithm 2 environment.
1) Notations for Algorithm 2: Let us denote the input and output vectors to the neural network as X = [x 0 , x 1 , . . . , x N−1 ] and Y = [y 0 , y 1 , . . . , y N−1 ], respectively. Let L represent the compressed representation of the partially-labeled bipartite graph from Algorithm 1. The depth of the current node being processed is d and the number of neurons processed at the current depth is j. Let Q represent a queue of nodes being processed in the algorithm. Let the weight corresponding to color i be w i . While the compressed representation of the network is being decoded and used for inference L, it is compressed back into another string L 1 . The child nodes of any given node are processed in ascending order of their colors: the color of the current child node being processed is tracked using a variable i.
In arithmetic/integer coding, symbols are often compressed and decompressed sequentially. Given a compressed string L that represents a sequence of compressed variables, we use AD(·)/IntD(·) to represent the function that decodes the first symbol from L using arithmetic/integer decoding and also returns the remaining compressed stringL. Similarly, given a compressed string L that represents a sequence of compressed variables, we use AC(·, L)/IntC(·, L) to represent the function that compresses any given value using arithmetic/integer decoding and returnsL that appends the compressed value of the given value to the compressed representation of the previous values L.
2) Overview of Algorithm 2: Algorithm 2 is a breadth-first search algorithm, which traverses the compressed tree representation of the two-layered neural network and updates the output of the neural network, say Y, simultaneously. Initialize Y = [0, 0, . . . , 0], d = 0, j = 0, an empty queue Q, and an empty string L 1 . Then, enqueue Q with N, decoded from L using integer decoding. Then, while queue Q is non-empty and the depth d ≤ N − 1, process all the child nodes of the node obtained by popping Q. The child nodes are decoded from L, used for updating Y, and then encoded back to L 1 and enqueued to Q.
Note the Y vector from Algorithm 2 is a permutation of the output from the original uncompressed network,Ỹ. Observe each element ofỸ has a corresponding vector indicating its connection with the input to the neural network, X; when all these elements are sorted in a decreasing manner based on these connections, it gives Y. This is since Algorithm 2 is designed to give the same Y vector for any ordering inỸ. 3 Proposition 1: Inference output Y obtained from Algorithm 2 is a permutation ofỸ, the output from the uncompressed neural network representation.
Proof: We need to show the Y obtained from Algorithm 2 is a permutation of the N×1 output vectorỸ, obtained by directly multiplying the N×N weight matrix W having weights belonging to the set {w 0 , . . . , w m } with the N × 1 input vector X and passing through the activation function. ForỸ = W T X, the pth element isỸ p = N q=1 W T p,q x q . In Algorithm 2, while traversing a particular depth q, we multiply all Y p s with X q W q,p and so when we reach depth N, we get the Y vector as required. The permutation ofỸ with respect to Y is because while compressing W, we do not encode the permutation of the columns, only retaining the row permutation.
Note thatỸ is computed in Algorithm 2 by a different method than traditional matrix multiplication, W T X. Rather than computing each elementỸ p as N q=1 W T p,q x q , Algorithm 2 updates all components ofỸ at the same time based on a single index of X as in the statement Add x d × w i to each of y j to y (j+c−1) . That is, given an input vector X, the impact of an index q on all indices ofỸ is at the same time by covering all the connections X q make with indices ofỸ. When all indices of X are taken into account, the resulting outputỸ is the same as from traditional matrix multiplication.
Definition 9: The dynamic space requirement of an inference algorithm is the total amount of space (in bits) required by the inference algorithm while performing inference from a compressed network.
Additional dynamic space requirement is simply the additional space required beyond the space required for storing the compressed representation of the network, L.
Proposition 2: The additional dynamic space requirement of Algorithm 2 is O(N).
Proof: Notice Algorithm 2 does use some space in addition to the compressed data. The symbols decoded from L are encoded into L 1 , hence, the combined space taken by both of them at any point in time remains almost the same as the space taken by L at the beginning of the algorithm. However, the main dynamic space requirement is from decoding individual nodes, and the queue, Q. Clearly, the space required for Q, storing up to two depths of nodes in the tree, is much more than the space required for decoding a single node.
Let us show the expected space complexity corresponding to Q is less than or equal to 2(m + 1)N(1 + 2 log 2 ( m+2 m+1 )) using Elias-Gamma integer codes (modified to also encode 0) for each entry in Q. Since Q has nodes from at most two consecutive depths, since only the child nodes of non-zero nodes are encoded, and since the number of non-zero nodes at any depth is less than N, we have a maximum of 2(m + 1)N nodes encoded in Q. Let α 0 , . . . , α k be the values stored in the child nodes of non-zero tree nodes at some depth d of the tree, where k ≤ (m + 1)N. If k < (m + 1)N, let α k+1 , . . . , α (m+1)N be all zeros. Let S be the total space required to store Q. Using integer codes, we can encode any positive number x in 2 log 2 (x) + 1 bits, and to allow 0, we need 2 log 2 (x + 1) + 1 bits [36]. Thus, the arithmetic-geometric inequality implies Theorem 3: The compressed representation formed in Algorithm 1 is succinct.
Proof: From Proposition 1 and Proposition 2 we know that the additional dynamic space required for Algorithm 2 is O(N), while the entropy of a partially-labeled bipartite graph is O(N 2 ). Thus, from the definition of succinctness, it follows that the structure is succinct.
Next, we find the time complexity of Algorithm 2.

IV. UNLABELED BIPARTITE GRAPHS
Next we consider an unlabeled bipartite graph for which we construct the adjacency matrix similar to before, but now the possible entries in each cell will be binary corresponding to whether or not there is an edge. We first compute the entropy bound for representing unlabeled bipartite graphs and then introduce a universal algorithm approaching the bound.

A. Entropy Bound
Although the structure is slightly different from partiallylabeled bipartite graphs, there are also interesting properties. Notably the connectivity pattern is independent of the order of the row vectors and column vectors in this bipartite adjacency matrix.
We say a matrix has undergone a row permutation if the order of the rows of the matrix is changed while keeping the order of cells in each row unchanged. Similarly, a matrix has undergone a column permutation if the order of the columns of the matrix is changed while keeping the order of cells in each column unchanged. We say a matrix has undergone a valid rearrangement if it has undergone a sequence of row and column permutations. Under any valid rearrangement, the unlabeled bipartite graph remains unchanged. Now we define row blocks and column blocks. Let A be the adjacency matrix of a bipartite graph and a ij be the cell at row i and column j. If a valid rearrangement of A transforms it to matrix A , then if a cell at row i and column j of A has moved to row k and column l of the matrix A , then the set of cells in row i of A is the same as the set of cells in row k of A . We call this set of cells at row i the row block corresponding to the cell a ij , since this set of cells corresponding to a ij does not change under any valid rearrangement. Similarly, we call the set of cells at column j, the column block corresponding to the cell a ij .
We show the entropy of an unlabeled random bipartite graph is N 2 H(p) − 2 log 2 (N!) + o (1). To that end, we need the following definitions.
Definition 10: Let B u (N, p) be an unlabeled random bipartite graph model generating graphs in the same way as a random bipartite graph model, except that the vertices in both the sets, U and V, are unlabeled, but the sets U and V themselves remain labeled, i.e., two sets of unlabeled vertices having the same edge connections as that of a random bipartite graph. Considering the permutations of vertices in the sets V and U themselves, we have a total of (N!) 2 permutations. Given that each unlabeled graph b u corresponds to |I(b u )| number of bipartite graphs, and each bipartite graph b ∈ B (N, p) corresponds to |Aut(b)| (which is equal to |Aut(b u )|), we get the number of adjacency-preserving permutations of vertices in the graph, from [32], [33], as: We also know the entropy of a random bipartite graph, H B , is N 2 H(p). The entropy of an unlabeled graph is: Let us now use Lem. 18 in the Appendix, in the supplementary material, on symmetry of random bipartite graphs to compute entropy.
Note that |Aut(b u )| = 1 for asymmetric graphs and so: Further, note that H B = N 2 H(p). Hence, for any constant w > 1,

B. Universal Lossless Compression Algorithm
In this subsection, we provide a lossless compression algorithm for unlabeled bipartite graph which is optimal up to the second-order term. Algorithm 3 takes the adjacency matrix of an unlabeled bipartite graph as input and outputs two tree structures which are invariant to any valid rearrangement of the graph. We provide an overview of Algorithm 3 next. It is based on the same principles as Algorithm 1, but has two trees that are needed for row and column invariance, rather than just one tree which essentially appeared in [29].
1) Overview of Algorithm 3: Unlike Algorithm 1, here both the row and the column of the adjacency matrix do not have labels. In Algorithm 1, the depth of the tree corresponded to the labeled inputs, whereas the node values of the tree captured the invariance in the output nodes. To ensure invariance in both the input and the output nodes of the weight matrix, we use two trees t 1 and t 2 . Here, we encode the matrix by constructing the depths of these trees in an alternative way such that both the invariances are captured. In particular, t 1 captures the invariance in the column, whereas t 2 captures the invariance in the row. Detailed descriptions of these tree constructions are provided in the pseudocode of Algorithm 3 environment.
Once constructed, the trees are compressed as follows: we perform a breadth-first search on each of the trees and the child nodes of a node with value, say N x , are first stored using log 2 (N x + 1) bits and then the bit-stream produced after the completion of the breadth-first search is compressed using an Divide every non-empty leaf node at the current depth of tree t 1 into two child nodes. The left child denotes the number of 1-cells that are unmarked in the column block containing the parent cell; similarly the right child denotes the remaining 0-cells that are unmarked. 5: Mark all unmarked cells in the column block containing the parent cell. 6: Remove an element from the leftmost node of the tree t 2 .

7:
Choose any cell from the newly formed leftmost child of the tree t 1 as the parent cell. 8: Divide all the leaf nodes at the current depth of the tree t 2 into two child nodes. The left child denotes the number of unmarked 1-cells in the row block containing the parent cell; similarly the right child denotes the remaining 0-cells that are unmarked. 9: Choose any cell from the newly formed leftmost child of the tree t 2 as the parent cell. 10: Mark all the unmarked cells in the row block containing the parent cell. 11: Remove an element from the leftmost node of the tree t 1 .

12:
Increase depth of t 1 and t 2 by 1. 13: end while arithmetic encoder. Note that the binomial distribution is used for arithmetic coding, with p as the probability of existence of an edge between any two nodes of the bipartite graph and q = 1 − p.
The structure of the trees formed in Algorithm 3 is the same as in [29] except there are two trees in our algorithm and the first tree does not lose an element from the root node on its first division. Let us now define a tree structure which will be useful for performance analysis of the algorithm.
Definition 12: Let T n,d,p be a class of random binary trees such that any tree T n,d,p ∈ T n,d,p has depth (n − 1) and is generated in the following way: 1) The root node is assigned the value n and placed at depth 0. 2) If d = 0, skip this step. Otherwise if d > 0, then starting from depth, t = 0 to t = d − 1, divide each of the nodes with non-zero values at the current depth into two child nodes such that the sum of the values assigned to the child nodes is equal to that of the parent node (say N), and the left child node has value N 1 distributed as binomial distribution, N 1 ∼ Binomial(N, p). 3) Starting from depth t = d to t = n−2, subtract the value of the leftmost node with non-zero value and divide each of the non-zero nodes at the current depth into two child nodes in the same way as in the previous step using the updated node values after subtraction. That is, the sum of the values of the child nodes is equal to that of the updated value of the parent node, and the left child node has value assigned to it using binomial distribution. We write T n,d,p as T n,d when p is clear from context, and we use the notations T n,0 and T n interchangeably.
Let N x be the number of elements in some node x of either of the trees formed in Algorithm 3, say T where T can be t 1 or t 2 formed in the algorithm. Then the total number of bits required to encode the tree before using arithmetic coding is x∈T and N x ≥1 log 2 (N x + 1) . Define L 1 = x∈T and N x >1 log 2 (N x + 1) and L 2 = x∈T and N x =1 log 2 (N x + 1) . LetL 1 andL 2 be the length of bit-streams corresponding to L 1 and L 2 respectively after arithmetic coding. So, the total expected bit length is E[L 1 ]+E[L 2 ] before using arithmetic coding, and E[L 1 ] + E[L 2 ] after using arithmetic coding. Now define a n,d = E ⎡ ⎣ x∈T n,d and N x >1 Now we bound the compression performance of Algorithm 3. The proof is based on a theorem for compression of graphical structures [29]; before stating and proving our result, we recall two lemmas from there.
Lemma 3: For all integers n ≥ 0 and d ≥ 0, a n,d ≤ x n , where x n satisfies x 0 = x 1 = 0 and for n ≥ 2, x n = log 2 (n + 1) + n k=0 n k p k q n−k (x k + x n−k ).
Lemma 4: For all n ≥ 0 and d ≥ 0, such that y n satisfies y 0 = 0 and for n ≥ 0, y n+1 = n + n k=0 n k p k q n−k (y k + y n−k ).
Theorem 5: Let L be the number of bits Algorithm 3 takes to represent an unlabeled bipartite graph; then E[L] ≤ N 2 H(p) − 2N log 2 (N) + 2(c + (log 2 (N + 1)))(N + 1) + o(N), where c is an explicitly computable constant, and (log 2 (N + 1)) is a fluctuating function with a small amplitude independent of N.
Proof: We need to find the expected value of the sum of all the encoding-lengths in all nodes of both trees. The expected value of length of encoding for both trees can be upperbounded by an expression from [29]. Let us formally prove that both encodings are upper-bounded by this expression. If E[L t 1 ] and E[L t 2 ] are the number of bits required to represent trees t 1 and t 2 , respectively, then the following equations hold.
Similarly, E[L t 1 ] and E[L t 2 ] are the number of bits required to represent trees t 1 and t 2 after using arithmetic coding, respectively. Using Lem. 3 and Lem. 4, and bounds on x n and y n from [29] it follows that for any d ≥ 0: Hence, the sum: where c is an explicitly computable constant and (log (N + 1)) is a fluctuating function with a small amplitude independent of N. This completes the proof.
Using Algorithm 3 for unlabeled bipartite graphs, we save roughly N log 2 N bits compared to compressing partiallylabeled bipartite graphs using Algorithm 1.

V. DEEP NEURAL NETWORKS
Let us return to the K-layer neural network model from Section II. First we extend the algorithm for unlabeled bipartite graphs to K-layered unlabeled graphs, and then store the permutation of the first and last layers. This yields an efficient compression algorithm for a K-layered neural network, saving around (K −2)×N log 2 N bits compared to standard arithmetic coding of weight matrices.
Algorithm 4 takes the feedforward neural network in the form of its weight matrices as input and outputs K tree structures that are invariant to any valid rearrangement of the weight matrices. Then these trees are compressed in a similar manner to unlabeled bipartite graphs in Section 3. We first perform a breadth-first search on each of the trees and the child nodes of a node with value, say N x , are first stored using log 2 (Nx + 1) bits and then the bit-stream produced after the completion of the breadth-first search is compressed using an arithmetic encoder. The binomial distribution is used for arithmetic coding, with p as the probability of existence of an edge between any two nodes of the bipartite graph and q = 1 − p.  N + 1)) is a fluctuating function with a small amplitude independent of N. Algorithm 4 Compressing a K-Layer Unlabeled Graph 1: Form root nodes of K binary trees t 1 , t 2 , . . . , t K corresponding to K layers of the neural network, and store N in the root node of all the trees, corresponding to the N neural network nodes in each of the layers. 2: Initialize iteration number, i = 1, and layer number, j = 1.

A. Universal Lossless Compression Algorithm Using Unlabeled Bipartite Graphs
Let (j) represent the set of indices of trees corresponding to layers neighboring to the jth layer of the neural network. 3: while depth of i ≤ N do 4: while depth of j ≤ K do 5: Selection: Select a node of the neural network from layer j that corresponds to one of the neural network nodes in the leftmost non-zero node of t j and subtract 1 from the leftmost non-zero node of t j .

6:
Division: Divide every non-empty leaf node of the trees t k for k ∈ (j) into two child nodes based on the connections of the neural network nodes corresponding to the leaf nodes with the selected node in the previous step. The left child denotes the number of neural network nodes not connected to the selected node; similarly the right child denotes the neural network nodes connected to the selected node. Proof: The encoding of Algorithm 4 is similar to the encoding of Algorithm 3. For all trees, the child nodes of any node with non-zero value N x are stored using [ log 2 N x + 1] bits. Let the number of bits required to encode the jth layer be L j . These bits are further compressed using an arithmetic coder, which gives us, say,L j bits for the jth layer. Observe that the trees for the first and Kth layer belong to T N,0 and T N,1 respectively. Hence, based on results from previous sections, But the binary trees formed for layers 2 to K − 1 are different. Instead of a subtraction from the leftmost non-zero node at each division after the first d divisions as in a T n,d type of tree, in these type of trees-let us call them T 2 n,d treessubtraction takes place in every alternate division after the first d divisions. We follow the same procedure for compression of t 2 to t K−1 as for t 1 and t K , i.e., we encode the child nodes of a node with value N x with [ log 2 N x + 1] bits followed by arithmetic coding. Now define, We show that a 2 n,d ≤ x n and b 2 n,d ≥ y n − n 2 for x n and y n as defined in Lem. 3 and Lem. 4, respectively. These are stated and proved as Lem. 19 and Lem. 20 in the Appendix, in the supplementary material.
Returning to the proof, since the trees t i for i ∈ {2, . . . , K − 1}, are all of the same type, we have the same expected coded length for each of them. Let the expected encoding length for a tree t i for i ∈ {2, . . . , K − 1} before using arithmetic coding be E[L i ], and that after using arithmetic coding be E[L i ]. Then, 1 . Using upper bounds from in Lem. 19 and Lem. 20, in the Appendix, in the supplementary material, we know from [29] that where c is an explicitly computable constant and (log (N + 1)) is a fluctuating function with a small amplitude independent of N. Further, since we need to store the permutation of the input and output layers, we need to store another 2 N log 2 N bits. This completes the proof.

B. Universal Lossless Compression Algorithm Using Partially-Labeled Bipartite Graphs
Now consider an alternative method to compress a deep neural network, using Algorithm 1 iteratively to achieve efficient compression.
Theorem 7: Let L be the number of bits required to represent a K-layer neural network model through iterative use of Algorithm 1. Then Proof: If we focus only on the first two layers of the neural network model, then by Lem. 1, it can be compressed in less than log (K i !)] number of bits. Once the first two layers are encoded, one can label the nodes of the second layer based on the relationship of its connectivity with the nodes of the first layer, and treat the second layer as a labeled layer. Also, the third layer is unlabeled and hence Algorithm 1 can be used again to compress the second and third layer using less than log (K i !)] number of bits. This, can be repeated until all layers are encoded. Further, we must store the permutation of the outer layer of the neural network, which takes an additional log 2 N! bits.
Hence, iteratively encoding the K layers gives: where c is the additional number of bits that an arithmetic coder takes to start and finish encoding. We gave two different compression algorithms for feedforward neural networks. The algorithm based on partiallylabeled graphs appears to be less efficient than the one based on unlabeled bipartite graphs since after removing invariances from each layer, it treats the hidden layer as a labeled layer for compressing the next hidden layer, introducing some redundancy. However, both algorithms are asymptotically optimal up to the second-order term. Further, the algorithm based on the partially-labeled graph is easier to implement and also enables easy updates in the compressed structure. Hence, in the next subsection, we provide an inference algorithm that makes use of compressed representation of a feedforward neural network generated using the iterative algorithm introduced in this subsection.

C. Inference Algorithm
Inference for a K-layered neural network is just an extension of Algorithm 2. In particular, the output of Algorithm 2 becomes the input for the next layers. However, one important point to consider in compression, so as to ensure the inference algorithm of the K-layered neural network still works, is to appropriately rearrange the weight matrices. Note that Algorithm 2 outputs the Y in a specific pattern, i.e., the output Y is sorted based on the connections of output nodes with the input nodes. Thus for the algorithm to work, we need to sort the weight matrix corresponding to the next layer accordingly before compressing. Also, note that the last weight matrix connecting to the output layer need not be compressed so as to preserve the ordering of the output layer nodes.
Theorem 8: The compressed structure obtained by the iterative use of Algorithm 1 is succinct.
Proof: Since each layer is computed one at a time in inference and the extra space required during the inference task of a 2-layered neural network is stored only temporarily, the extra dynamic space requirement for a K-layered remains the same as for the 2-layered neural network described in Algorithm 2. Hence, the compressed representation for the K-layered neural network is succinct.
Next we provide the time complexity for inference using Algorithm 2 iteratively and compare it with inference on an uncompressed neural network.
Proposition 4: The time complexity of Algorithm 2 used iteratively on a K-layered neural network for inference is O(mKN 2 ). The time complexity for inference on an uncompressed neural network is O(KN 2 ).
Proof: From Proposition 3, we already know the time complexity of Algorithm 2 is O(mN 2 ). Clearly, iteratively using Algorithm 2 K times takes O(mKN 2 ) time. Further, each layer of an uncompressed neural network requires O(N 2 ) computation due to matrix multiplication of a vector of size 1 × N with a weight matrix of size N × N. Hence, K such layers take O(KN 2 ) time.

VI. EXPERIMENTS
To validate and assess our neural network compression scheme, we trained feedforward neural networks using stochastic gradient descent on three datasets, and quantized them using different quantization schemes before using our lossless compression scheme. The three datasets are The weight matrices from the second-to-last layer were rearranged based on the weight matrices corresponding to the previous layers as needed for Algorithm 2. These matrices, except the last matrix connected to the output, were compressed using Algorithm 1 to get the compressed network, and arithmetic coding was implemented by modification of an existing implementation. 4 The compressed network performed exactly as the original quantized network (as it should have) since our compression is lossless. We observe that the extra memory required for inference is negligible when compared to the size of the compressed network. Detailed results from the experiments and dynamic space requirements are described in Tab. II, Tab. III, and Tab. IV for the MNIST, IMDB, and Reuters datasets respectively, where H(p) is the empirical entropy calculated from the weight matrices.  In these tables, the term MNH(p) − N log 2 N represents an approximation to the theoretical bounds in Thms. 6 and 7 since computing the exact bounds is difficult. The parameters "Avg. queue length" and "Max. queue length" represent the average and maximum dynamic space requirements for Algorithm 2 respectively. The fact these two parameters have small values compared to the size of the network implies that inference without full decompression of the network takes marginal additional dynamic space.
Tab. V and VI measure the time needed for Algorithm 2. Tab. V gives a comparison between time taken for inference using compressed and uncompressed neural networks. The experiments were run using a naive Python implementation on a system with 12GB RAM, Intel Xeon CPU @ 2.20GHz processor. Note that in Tab. V and VI, the neural networks are named after the data they were trained on and their quantization levels for conciseness, and that the number of parameters is the number of weights in a network. Tab. VI provides the distribution of time taken by different components of Algorithm 2. In particular, in Tab. VI '% pmf computation' and '% arithmetic decoding + re-encoding' denote the percentage of time taken for computation of the pmf for arithmetic coder, and for decoding and re-encoding respectively. Results show that time taken for making inference using compressed networks is considerably higher than corresponding uncompressed neural networks, but seemingly not impractical on an absolute scale. We further investigate the time taken by different components of Algorithm 2 in Tab. VI. Observe that roughly 90% of the time taken in Algorithm 2 is due to arithmetic encoding/decoding and probability matrix computation. Arithmetic coding is an essential component of our inference algorithm and so computational performance is also governed by efficient implementations of arithmetic coding. Efficient high-throughput implementations of arithmetic coding/decoding have been developed for video, e.g., as part of the H.264/AVC and HEVC standards [2], [3]. Such efficient implementations would likely improve time required for our algorithms considerably.

VII. CONCLUSION
Information stored in memory and used for computation is often no longer of conventional type such as sequential texts or images, but rather includes structural data such as artificial neural networks, connectomes, phylogenetic trees, and social networks [29], [40]. Moreover there is growing interest in using neural network models for on-device intelligence and for scaling cloud-based intelligence, but high-performing deep neural networks are very large in size. To ameliorate this storage bottleneck, we have developed lossless compression algorithms for feedforward deep neural networks that make use of their particular structural invariances in inference and can act as a final stage for other lossy techniques [21].
Given there may be limited prior knowledge on the statistics of synaptic weight and structure, our compression schemes are universal and yet asymptotically achieve novel entropy bounds. Further, we show the proposed compressed representations are succinct and can be used for inference without complete decompression. These compression algorithms can also be directly used in fully-connected layers of other variants of neural networks, such as convolutional neural networks or recurrent neural networks.
In future work, we plan to investigate optimal quantization of real-valued synaptic weights using ideas from functional quantization [41], but taking into account our novel form of entropy coding.