Visualizing bivariate long-tailed data

: Variables in large data sets in biology or e-commerce often have a head, made up of very frequent values and a long tail of ever rarer values. Models such as the Zipf or Zipf–Mandelbrot provide a good description. The problem we address here is the visualization of two such long-tailed variables, as one might see in a bivariate Zipf context. We introduce a cop- ula plot to display the joint behavior of such variables. The plot uses an empirical ordering of the data; we prove that this ordering is asymptotically accurate in a Zipf–Mandelbrot–Poisson model. We often see an association between entities at the head of one variable with those from the tail of the other. We present two generative models (saturation and bipartite preferential attachment) that show such qualitative behavior and we characterize the power law behavior of the marginal distributions in these models.


Introduction
It is increasingly common to see data sets in which two or more categorical variables each have a long-tailed distribution, of which the Zipf distribution is the best known example. In the Netflix data, some movies were much more popular than others, and some users were much more active than others. In large social or biological networks, both the in-degree of nodes and the outdegree of nodes may be long-tailed. For electronic commerce there can be many more than two such variables. For example, an online bookstore may keep track of customers' IP addresses, books' ISBNs, credit card numbers, query strings and book review identifiers.
In commercial settings, there is considerable interest in prediction, as exemplified by the Netflix prize (Bennett and Lanning, 2007), which centered on predicting the ratings that a user would give a movie. The prize was eventually won by a method that combined hundreds of more basic prediction rules.
Ensemble methods and other black box predictors have performance that is hard to beat, but they do not easily supply qualitative insights about the phenomena being modeled. Qualitative insights are useful for understanding the mechanisms underlying the data, especially when one contemplates changing the mechanism.
Our goal in this paper is to develop some methods for exploring multiple long-tailed variables. With ordinary pairs of variables one can easily form a scatterplot to explore their joint behavior. For a single long-tailed variable, a Lorenz curve shows, for example the fraction of wealth held by the poorest α% of the population, as a function of α. Our plot shows a bivariate display of this type.
Earlier work on assortative and disassortative mixing by degree (Newman, 2003) focused on a correlation coefficient. Social networks were predominantly positively assortative by degree while diverse kinds of biological network were negatively assortative. Positive association between highly connected elements is also called the rich-club ordering (Colizza et al., 2006).
In this paper, we look at graphical displays of the entire joint distribution. We see several different patterns, that can then be interpreted in terms of the original data. A given correlation coefficient could be consistent with many different data patterns. The patterns we see are often concentrated at the extreme ranges, head or tail, of the data. We also see some clusters at unexpected locations in the middle of the data ranges.
A famous example of ratings data comes from the Netflix prize (Bennett and Lanning, 2007) with just over 100 million movie ratings made by 480,189 customers on 17,770 movies. Inspecting the Netflix data it becomes clear that the busy raters tend to rate the less popular movies and that the popular movies tend to attract the less active raters. A similar phenomenon happens in other data sets. But the strength and nature of these affinities differ from data set to data set. Furthermore the affinities don't have to be symmetric: popular movies are less strongly associated with rare raters than busy raters are with unpopular movies.
An outline of the paper is as follows. Section 2 presents a gray scale copula display to show the joint distribution of two long-tailed quantities. The displays show the shape, size and sometimes surprising location of the affinities. Section 3 shows examples of the copula display on some large data sets typical of ecommerce applications.
In making the copula displays we have implicitly assumed that sorting entities by their observed size in the data set puts them into the correct order that we would see in an infinite sample. Section 4 gives results showing that most of the data are placed near where they should be, for large sample sizes.
The ratings data we looked at typically showed head-to-tail affinities. Headto-tail affinities for raters and rated items can be explained in terms of experienced raters having more varied and sophisticated tastes than beginners. Section 5 proposes two mechanical baseline models that provide alternative explanations. One is a saturation model in which raters and items are independently sampled but subject to a limit such that no pair is counted more than once. For the bipartite case, saturation creates a head-to-tail affinity but we find that it generates unusual marginal distributions. A second model invokes bipartite preferential attachment. This model provides reasonable marginal distributions and we find head-to-tail affinities. Our conclusions are in Section 6 along with a discussion of additional similar plots that one could make. Theorem proofs are in an Appendix. m j=1 X ij and X •j = n i=1 X ij . We assume that the row entities have been sorted so that X 1• ≥ X 2• ≥ · · · ≥ X n• and similarly X •j ≥ X •j+1 . It is convenient to refer to large and small entities, where the size of an entity is simply its marginal sum. Our display is a gray level plot in the unit square [0, 1] 2 . The row entities are on the horizontal axis arranged from smallest at 0 to largest at 1. The amount of horizontal space given to row entity i is proportional to X i• . The column entities are similarly arranged on the vertical axis, again with smallest at 0, largest at 1 and with length proportional to X •j . If one then selects a data point X ij uniformly from those in the sample, it corresponds to a point in [0, 1] 2 with uniformly distributed horizontal and vertical coordinates.
For our plot we split the unit square into rectangles and shade them in gray. The gray level of a rectangle is proportional to the sum of the observed X ij values in it, divided by the area of that rectangle. We call this a copula plot because the gray level is proportional to a bivariate density with uniform margins.
With this convention, a dark upper right corner means that the large row entities are more strongly associated with the large column entities than they would be under independence, while a light upper right corner means that the heads of the two distributions avoid each other. Similar interpretations apply to the other three corners. Table 1 shows a small example. It has seven nonzero numbers totalling 10. The data include four different row entities I, II, III and IV. There are three different column entities A, B and C. The first row indicates that the combination A-IV was observed one time. Table 1 This table shows a small example with 7 nonzero counts summing to 10. One variable takes values A, B, C, · · · while the other takes values I, II, III, · · · . The copula plot for this data is shown in Figure 1 Column variable Row variable X ij   A  IV  1  B  III  1  B  IV  2  C  I  1  C  II  2  C  III  2  C  IV  1 Visualizing bivariate long-tailed data  The row entities, sorted from least to most frequent are I, II, III and IV with relative frequencies 0.1, 0.2, 0.3 and 0.4. The column entities in increasing order are A, B and C with relative frequencies 0.1, 0.3 and 0.6.
The raw counts and marginal totals are shown in the left panel of Figure 1. Each entity is given horizontal or vertical space proportional to its frequency. As a result the upper right corner, C-IV, has the largest rectangle. That rectangle has area 0.6×0.4 or 24% of the unit square. But it only has 1 of 10 counts or 10% of the data. Therefore we plot it with a gray level proportional to 10/24 . = 0.42. The combination A-IV was also observed 1 time in 10 but it has smaller area 0.1 × 0.4 = .04 and so it is plotted with a gray level proportional to 0.1/0.04 = 2.5. The image was scaled so that the gray scale from 0 to 1 corresponds to a ratio from 0 to the maximum observed, in this case 2.5.
For large data sets it is not effective to plot one row of rectangles for each column entity and vice versa. Instead we aggregate the data to reduce the display to a manageable number of cells. In our large examples, we have aggregated so that there are 100 cells along each dimension. Sometimes there are many small entities with the same marginal total, such as 2 total links, that in aggregate comprise a large proportion of the image. In such cases, our plot does not split the rectangle corresponding to such a popular level.

Copulas and discrepancies
By construction, the data being plotted are nonnegative, and have line integrals equal to 1 if one component of [0, 1] 2 is fixed and the other is the variable of integration. Therefore the gray level plot depicts a bivariate copula density (Nelsen, 2006).
The gray level plot allows us to compare the observed copula to a uniform one. A uniform copula would be uniformly gray. Darker and lighter gray indicate regions with higher (respectively lower) joint density than they would have under uniformity. Metrics comparing distributions to uniformity on a hypercube are called discrepancies. See Niederreiter (1992) for a discussion of discrepancies in quasi-Monte Carlo sampling. Our data display reveals a local discrepancy, with near black pixels showing positive discrepancy, and near white ones showing a negative discrepancy.
The random variables whose copula we plot require a few remarks. They are ordered categorical variables derived from the original categories by sorting according to relative abundance. For example, movies are an unordered categorical variable. They can be sorted into an ordered categorical variable based on the number of ratings that they got. In practice, we sort them on their sample popularity, which may deviate from their popularity in the process from which they are sampled.

Maslov and Sneppen's display
Our display superficially resembles one used by Maslov and Sneppen (2002) to show patterns in protein interaction networks. They plot a normalized empirical probability that two proteins of given degree are connected. For a bipartite graph joining prey proteins to bait proteins, the probability is the fraction of all edges that connect a prey protein with degree K 0 to a bait protein of degree K 1 . The normalization is the same probability, for a network constructed with random edges where all nodes retain their original degrees.
Whereas their normalization is based on randomly rewiring the edges in the graph, ours is based on a deterministic respacing of the axes. The Netflix data set has about 100,000,000 edges and the Yahoo! song ratings data set is even larger. For such large data sets, it is computationally burdensome to randomly resample the graph, but quite simple to rescale the coordinate axes.
We find it interesting that Maslov and Sneppen (2002) procedure is not equivalent to ours. To show that they are different, it is enough to use a very tiny example. Consider data of the form: A1, B2, C1, C2, C3, C4, C5, C6, as illustrated by the graph in Figure 2.
Each step in their random graph algorithm selects two edges, say uv and xy. Then if uy and xv are not edges in the graph, the selected edges are removed and replaced by uy and xv.
Consider the tiny graph of Figure 2. Our measure takes the value 0 for connections between the letters of lowest degree (A and B) and numbers (1,2,3,4) of lowest degree, because there are no such connections in the graph. Inspecting that graph, we see that none of the edges connected to C can be removed. As a result, the ensemble of random graphs has only two members, the one shown and the one obtained by rewiring A1 and B2 to A2 and B1. That rewiring leaves all degrees unchanged and so their ratio takes the value 1 everywhere, and hence does not show the lack of affinity between {A, B} and {1, 2, 3, 4}.
Their measure displays connectivity relative to that under random rewiring. Ours is relative to independence of row and column entities. This allows us to use the copula interpretation, described in Section 2.2.

Examples
Here we show some examples of the copula plots. The first set of examples come from bipartite graphs showing relations between two types of entity. The second set show directed graphs where the two quantities being displayed are in-degree and out-degree. The third group of examples are for symmetric matrices. Figure 3 shows the discrepancy plot for the Netflix data. The raw data take the form X ij if user i rated movie j. We are visualizing the rating events themselves, not the ratings. We comment briefly below on how the ratings themselves look.

Bipartite graphs
The left panel of Figure 3 shows that the Netflix data has a strong head-to-tail affinity. Users who rate few movies are over-represented at movies that received many ratings. Conversely, movies that are rarely rated get the majority of their ratings from users who rate many movies. This finding suggests a taste-based explanation: novices primarily rate blockbusters, while the cognescenti have also searched out some rare gems.
The head-to-tail spikes so dominate the plot that we cannot easily see other smaller structures. In the right panel of Figure 3, the data are replotted with a different gray scale. We have used 256 gray levels chosen so that each level is used (within rounding error) for the same number of pixels. As a result, the histogram of gray levels from the image is uniform. The left panel shows the discrepancy plot for the Netflix data. Dark spots in the upper left and lower right corners show a head-to-tail affinity for this data. The right panel using a uniform gray level reveals finer structure: some affinity among small movies and users, depletion for large movies and users, horizontal stripes and some weaker vertical stripes.
The uniform histogram view of the Netflix data reveals that there is a small tail-to-tail affinity but no head-to-head affinity at all, in the Netflix data. We see that the copula contours in the two head-to-tail corners have different shapes, more rectangular for big movies but rounder for big raters. There are also prominent horizontal stripes corresponding to consecutive blocks of movies that go against the grain compared to the bulk of the data. Similar vertical stripes are fainter. This may be because there are more raters than movies. Because these features are small compared to the head-to-tail affinities, they do not show up in the left panel. Combining both views shows more about the data than either on its own. Figure 3 displays where the ratings come from, but not how high or low those ratings are. To do that, we could define X ij to be some increasing function of {1, 2, 3, 4, 5} and redo the plot. We have done this, with some extreme scores. The first score takes X ij = 1 if the rating was 5, and 0 otherwise. It just looks at particularly fortuitous movie-customer combinations, the kind that a recommender would most like to have made. A second score took X ij = 1 if the rating was 1 and X ij = 0 otherwise. This score looks at the combinations that a recommender would most regret. Those plots (data not shown) both had strong affinities between inactive users and frequently rated movies, but much less affinity between busy users and rarely rated movies. Figure 4 shows the Yahoo! song ratings data. Here X ij is 1 if user i rated song j. Once again we see head-to-tail affinity, but it takes a different shape than it did for the Netflix data. The busy users dominate the bottom 10% or more of songs, much more than the busy movie raters dominated the least viewed movies.
Further differences between the two data sets show up in comparing the figures. While the movie data has some tail-to-tail overlap, the song data has very little. Also, the shape of the contours in the lower right corner (active raters  on unpopular items) is more rectangular for the songs than for the movies. For movies there is a progression where ever more rare movies get their ratings from ever more busy raters. For songs, the contour is much straighter.
We can extract numerical assessments of head-to-tail affinities from this data. The busiest raters rate the popular movies only about 22.5% as often as they would under independence. This and other summaries appear in Table 2. From that data we also see that the corner associations are stronger for movies than for songs. Also the two head-to-tail affinities are roughly equally strong in the song data, but, for movies, the affinity between busy users and rare movies is stronger than the reverse. Figure 5 shows data from the Internet Movie Database (IMDB). The variable X ij = 1 if and only if actor i was in movie j. Actors who worked rarely (e.g. only once) are overrepresented in movies with the largest casts and nearly absent from movies with smaller casts. On the other hand, the busiest actors are overrepresented in movies with small casts, but not the very smallest casts.
This display also provides an example where a single level accounts for a large proportion of the data: actors appearing in only one movie account for about 16% of the data. This explains the wide column of cells on the left hand margin of the display which has not been split due to ties.

Directed graphs
In a directed graph, each entity has both an in-degree and an out-degree. We are interested in visualizing the joint distribution of these two quantities.
Copula plots for some publicly available directed graphs appear in Figure 6. The upper left panel shows trust relations at the consumer review site Epinions. Here X ij = 1 if user i trusts user j. We see two dark clusters. At the lower right we see that there are lots of edges from users who trust many people to users who are trusted only by a few others. We might see that pattern in users with computer generated trust out-links. If many of the links are of that type, then it might be wise to discount such potentially spammy links. Near the lower left we see that there are lots of edges from users who trust only a few people to users trusted only by a few others. Such links might arise from infrequent users trusting their friends, though there are other possibilities (they could also be computer generated).
The upper right panel of Figure 6 shows a snapshot of the Wikipedia graph as of February 2007. Here X ij = 1 if the page for topic i links to the one for topic j. Each topic included had at least one in-link and at least one out-link. We might expect to see hubs and authorities (Kleinberg, 1999) in this graph. There is a cluster of topics with large out-degree and small in-degree. It included many lists, as we might expect for hubs. One striking mode represents pages with a medium-small number of out-links and a high, but not maximal, number of inlinks. Upon inspection, this hotspot included many topics that are either years or locations-in particular, countries. This is roughly what we might expect for authorities but they did not quite land in the upper left hand corner where one might have expected. Some pages have both small in-degree and small outdegree. Many of those are stubs. It is exceedingly rare for a topic to have both low out-degree and high in-degree.
The lower left panel of Figure 6 shows some data from Hewlett-Packard's Snapfish photo sharing service. This is a directed graph where X ij = 1 if user i shares a photo album with user j. Curiously, there is a fairly strong, though asymmetric, head-to-head affinity where very active sharers tend to share with the most active sharees. A large proportion of people that share photos do so only once, and, as indicated by the somewhat darker lower left corner of the plot, they tend to share with those that also accept very few sharing requests. The normalized view of this data in Figure 7 brings out some of these features. The lower right panel of Figure 6 shows a snapshot of the arXiv hep-th paper citation network. The normalized version of this plot is in Figure 7. Here X ij = 1 if paper i cites paper j. We see affinities among papers with few citations. Only citations within hep-th are counted so the affinity at the low end is not just among papers with few total citations or references.

Symmetric matrices
Undirected graphs have symmetric incidence matrices, so X ij = X ji . In graphs based on social network links we have seen positive associations: members with the most out-links get the most in-links and conversely. This is the opposite of the head-to-tail affinities we saw in ratings data. Such a pattern is not preordained by the symmetry. It is possible for graphs to contain a strong hub and spoke pattern, as for example protein networks do. Figure 8 shows two publicly available data sets. For the Enron email data, X ij is 1 if one or more emails were exchanged between addresses i and j. This version of the data is symmetric by construction. There are some head-to-tail affinities that we would expect if some email is broadcast to all or most accounts. There are also some affinities among smaller participants of equal size, giving dark squares along the main diagonal.
The second data set depicted in Figure 8 is based on a network of roads in California, where X ij = 1 if intersection i connects to intersection j. That is, vertices correspond to intersections and edges to road segments. The structure of this network is quite different from those involving people as entities. There is a very strong head-to-head affinity, as major intersections connect to each other. Road segments with few connections tend to connect to other such roads, with one important exception. Intersections composed of only one connection do not connect to each other. The graph is nearly planar (as would be expected) and the maximum degree is only 12.

Numerical summaries
To compare the plots we make a numerical summary. For any rectangle, we can sum the values of X ij in it and divide the sum by X •• times the area of the rectangle. This ratio gives us a lift statistic with a value of 1 corresponding to neutral affinity. To study corner affinities we have found it useful to measure the lift over small squares in the corners.

Proper ordering
The gray scale images we present depict bivariate copula densities. In forming the copula, we have replaced a categorical variable on many levels, such as a movie customer by one single number, that customer's rank. It is an enormous convenience to replace categorical variables such as customer, query string, credit card number, IP address and so on by a single nonnegative integer. But it is possible that some of those variable levels will be given the wrong rank, because the amount of data for each entity is random. Dyer and Owen (2010) study the extent to which the biggest entities are placed in the correct order in a random sample. They suppose that entity i appears X i times where X i ∼ Poi(N θ i ) and θ i are decreasing values. Entities i and i + 1 are in the correct order if X i > X i+1 . From their Lemma 2, we find that They show that for the Zipf-Poisson ensemble in which θ i = i −α for α > 1 the top (BN/ log(N )) 1/(2+α) are correctly ordered with probability tending to 1 as N → ∞, when B < α 2 (α + 2)/4. Because α > 1, we can take B = 3/4 and so we anticipate getting the top (3N/(4 log(N ))) 1/(2+α) entities in the proper order. Asymptotically those entities will account for all but a fraction o(N −(α−1)/(α+2)+ǫ ) of the total data i X i , for any ǫ > 0.
While the top entities are properly ordered, our eye is also drawn to the corners defined by small entities. Small entities are not properly ordered because for them, the sampling fluctuations are relatively large. In the Zipf-Poisson ensemble, the top CN 1/(2+α) entities for C > 0 will include some ordering errors as N → ∞.
Here we extend the analysis from Dyer and Owen (2010) to show that most of the entities of any size are nearly correctly placed.
Suppose that movie i appears X i times and customer j appears Y j times. That one rating contributes to the darkness of the pixel at a point given by the relative size of movie i among the movie data and customer j among the customer data. If both the customer and the movie are located near where they should be, then the data from that rating is near its proper location. If the bulk of the movies and customers, weighted by their data size, are properly located then the copula plot is descriptive of the process generating the data, not just the one data set at hand.
We will suppose that, marginally, one of the entity types (e.g. the movies) comes from a Zipf-Mandelbrot ensemble defined as follows. The number of times entity i is observed is X i ∼ Poi(N θ i ), independently, where θ i = (i + k) −α for i ≥ 1, k > −1 and α > 1. The Zipf-Mandelbrot form is more flexible than the plain Zipf law (k = 0) and it provides a qualitatively better description of the data we study.
We want to show that very little of the data will appear at any great distance from where it should. For τ > σ ≥ 0, let The quantity J represents a normalized total of all data from entities with mean below N σ that have sample values at or above N τ . These entities have jumped ahead of their true location. Let T = ∞ i=1 X i be the total sample size. Then E(T ) = N ∞ i=1 θ i (and V(T ) = E(T )) and so the proportion of data jumping from below σ to over τ is nearly J(τ, σ)/C, for large N Similarly, we define The quantity S represents a normalized amount of data from entities that have slipped from a true location above N τ to a position at or below N σ. Our graphic is based on a fixed partition, such as 100 bins, into which the X i are placed. Suppose that the bin boundaries are β k for k = 1, . . . , 99. If all of J(β k , β k − ǫ k ) and S(β k + ǫ k , β k ) are small, for small ǫ k > 0, then the data within each bin are representative of the entities that belong in that bin.
Proof. We prove this in the Appendix.
Theorem 1 shows that very little of the total data counts can have jumped from below expectation N σ to N τ or above. From Markov's inequality, the jump fraction satisfies P(J(τ, σ) > ǫ) = O 1 N as N → ∞ for any ǫ > 0, and τ > σ > 0.
Proof. We prove this in the Appendix.

Data from the deep tail
In the Zipf-Mandelbrot-Poisson model there are an infinite number of entities. The very small entities comprise the deep tail of the ensemble. We can use Theorem 1 to get an estimate of the total amount of data that could jump from the deep tail into the near tail. We define the near tail by excluding the largest entities which comprise a proportion 1 − ǫ of the expected data. The deep tail is similarly defined through η where 0 < η < ǫ ≪ 1. The bound in Theorem 1 does not depend on the parameter k of the Zipf-Mandelbrot distribution, and so we illustrate it with k = 0, corresponding to the Zipf distribution. To simplify the formulas, we approximate the Zipf distribution with probability mass function proportional to i −α by the Pareto density αx −α on 1 < x < ∞. The u'th quantile of the Pareto distribution is x u = (1 − u) −1/α and the Pareto density there is αx −α u = α(1 − u) 1+1/α . Therefore the relevant thresholds are τ = αǫ 1+1/α and σ = αη 1+1/α .
The expected fraction of mass jumping from below σ to over τ is asymptotic to For illustration, taking ǫ = 0.01 and η = 0.005 and α = 2 we get 1 N 10 2 5/2 − 1 . = 2.14 N so very little of the data for sample entities making up the top 99% will have come from population entities not among the top 99.5%. It is not necessary to scale the thresholds proportionally to N . We can take N τ = 1 and N σ = η < 1 and find by equation (7) (a non-asymptotic expression used in the proof of Theorem 1 in the Appendix) that Thus variables X i with E(X i ) < η < 1 contribute a vanishing fraction of the total data, though there are infinitely many of them.

Affinity models
In this section we present some simple generative models for networks in which head-to-tail affinities arise. The first model is a saturation model in which a head-to-tail affinity appears as a simple consequence of no rater being allowed to rate any item more than once. The second model is a bipartite preferential attachment model in which each entity type exhibits a size based preference for the other entity type. Every entity starts out with a single edge and a preference for the pre-existing large entities of the opposite type. These models show that head-to-tail affinities can arise from very simple processes. The models have one or two parameters each. Thus they provide only crude approximations to the empirical copulas we have seen which may require several degrees of freedom in each of four corners to describe well. Figure 9 shows example copulas from each of these models.

A saturation model
In ratings data we often see strong head-to-tail affinities between raters and items. This may be explained through the idea that sophisticated raters have branched out to the less well known items, while the neophytes mostly stay with the items of massive popularity.
Before adopting such a taste-based explanation we should at least consider a much simpler one. A rater is very unlikely to rate the same item twice. Even if this happens, the system may well retain only the last rating that was made.  There are just not enough popular items for a busy rater to rate. These saturation effects alone would induce negative dependence. A large saturation effect has already been noted by Maslov et al. (2004) for symmetric networks, such as the Internet, where there are greatly diminished connections among the most connected nodes. We introduce a model that has independence apart from the limitation of one rating per rater-item pair. We let Y ij ∼ Poi(N ci −a j −b ), and, where c = c a c b = 1/(ζ(a)ζ(b)) and ζ(x) denotes the Riemann-zeta function. The random variables Y ij comprise a bivariate Zipf-Poisson ensemble. The Y ij are latent and we only observe the truncated values X ij . In the latent model, row and column entities are generated independently. The truncation that turns Y ij into X ij is more likely to deplete the head-to-head combinations than any others and this produces head-to-tail affinity. Figure 9(a) shows an example simulated with a = 1.5, b = 2.5 and N = 10 7 . We do indeed see an asymmetric negative dependence in the corners, though the affinity extends farther towards the center of the square than we have seen in real data. Figure 10(a) shows the marginal Zipf plot of the columns (the plot for the rows is similar). Instead of sorting the entities by observed counts, we have kept the original ordering. We see that these expected counts have curvature. Also the slope near the origin is not a, but has instead been altered by the sampling process. The bounds shown are those of the following theorem.  Theorem 3. Let X ij be sampled as the saturated bivariate Zipf-Poisson ensemble (3). Let X i• and X •j be the marginal sums. Then and, as i → ∞, By symmetry, analogous results hold for X •j where we swap the roles of i and j, and a and b, respectively.
Proof. We prove this in the Appendix.
From Theorem 3, the marginal distribution of the row entity behaves as a power law that starts at slope a/b for the largest entities and transitions to slope a for the small ones. Conversely the column entities have slope starting at b/a and transitioning to b. As a consequence, the large entities of one type follow a power law with rate ≤ 1 while large entities of the other type have a rate ≥ 1. We have not seen that pattern in any of the data sets we've investigated. As a result we believe that the head-to-tail affinity often seen in ratings data is not simply due to saturation. Models invoking taste therefore seem more plausible.

Bipartite preferential attachment
A second model for these data is a bipartite preferential attachment model. The model constructs a bipartite graph via a simple extension of the Barabási and Albert (1999) model.
There are several generative models for bipartite graphs. Bipartite graphs in which each node type have the same degree distribution can have edges randomly assigned as in Newman et al. (2002). Graphs in which one kind of node has a prescribed degree distribution and the other kind is sampled by preferential attachment have also been considered Guillaume and Latapy (2006).
We investigate a preferential attachment model that generates the degree distributions along with the edge connectivity. Bipartite preferential attachment describes a random graph with two parameters, an integer valued time t ≥ 1 and a probability p ∈ (0, 1). There are two node sets, M and N , corresponding to row and column entities respectively. At time t the graph has nodes i = 1, . . . , m(t) from M and nodes j = 1, . . . , n(t) from N . We will assume that the entity sets are distinct. The graph is represented by an ∞×∞ matrix with elements X ij = X ij (t) ∈ {0, 1}, of which only t elements are nonzero.
The process starts at t = 1 with m(1) = n(1) = 1 and a single edge connecting node 1 of M with node 1 of N . That is X 11 = 1 and X ij = 0 if i > 1 or j > 1. At each time t ≥ 2, we sample U t ∼ U (0, 1). If U t ≤ p, then we add a new node i = m(t) = m(t − 1) + 1 to M and connect it at random to one of the nodes j ∈ {1, . . . , n(t − 1)} in N , thereby setting X ij = 1. If U t > p, then we add a new node to N and connect it at random to one of the nodes 1, . . . , m(t − 1) of M. The random connections are always made by preferential attachment. A new node of one type is connected to a particular old node of the other type with probability equal to the degree of that old node at time t − 1, divided by the total number t − 1 of edges.
In Barabási and Albert (1999) it is argued that the degree distribution of the vertices in (unipartite) preferential attachment graphs decays as a power law with exponent 3. That is, if p k is the proportion of vertices of degree k, then p k = Θ(k −3 ) as the number of vertices goes to infinity. This was further formalized in Bollobás et al. (2001). Degree distributions in bipartite preferential attachment are fundamentally different from those in the unipartite case.
Theorem 4. Let X ij (t) be sampled from the bivariate preferential attachment model with p ∈ (0, 1) and q = 1 − p. Let X i• (t) and X •j (t) be the marginal sums and let M (k, t) = i 1 Xi•(t)=k and N (k, t) = j 1 X•j (t)=k be the number of vertices of degree k in M and N , respectively at time t. Then where the arrows denote both convergence of the mean and convergence in probability as t → ∞, and the asymptotic equivalence holds as k → ∞.
Proof. We prove this in the Appendix.
Theorem 4 shows that both marginal distributions follow power laws. Unlike the Barabási-Albert model, the degree distributions for the bipartite model do not always have a scaling coefficient of 3. One margin has a coefficient in the range (2, 3] and the other coefficient is in the range [3, ∞). In real-life networks, it has been often observed that scaling coefficients tend to fall between 2 and 4. Some extensions to the basic Barabási-Albert model do generate scaling laws with coefficients in (2, ∞) (Durrett, 2006, Chapter 4).
Like the saturation model, the bipartite model generates head-to-tail affinities, but they do not concentrate in the corners. This can be seen in Figure 9(b). In the appendix, we provide bounds in addition to the asymptotic statements in Theorem 4. Figure 10(b) shows the degree distribution of M for a simulation of one million (total) vertices with p = 4/9.

Discussion
We have investigated head-to-tail affinities for bivariate heavy tailed data arising from bipartite networks and directed networks. A graphical display reveals the locations and strengths of these affinities, along with other affinities that are initially surprising. Our display depends on ordering the entities by sample values of their magnitudes. Results on the bulk ordering of values show that for large sample sizes as we see in Internet applications, the graphs are indicative of properties of the underlying entities, not just the data at hand.
The copulas we have seen in real data rarely resemble classical parametric copula densities. As a result we advocate plotting the actual copula estimates instead of fitting parametric models. The Wikipedia plot in particular is distinct from all of the usual parametric copula densities.
Graphical displays cannot compete with sophisticated machine learning algorithms when the goal is to predict something like the rating a user will give an item. In those settings the best performing algorithms may be uninterpretable combinations of hundreds of predictions. The strength of graphical displays is that they can bring qualitative information to the attention of domain experts who may then interpret them and perhaps change the system somehow. We have found that people quickly start thinking about what the dense spots and voids in our copula density plots might mean in terms of the underlying entities.
No single display can capture all of the structure in enormous data sets of this kind. The graphs we present do not show explicit community structure, as, for example, those of Newman and Girvan (2004) and Palla et al. (2005) do. By grouping row and column entities by size we merge members from different communities revealing patterns that hold across communities and not necessarily within communities.
The images we present can easily be generalized. The row entities can be sorted by one variable, the column entities by another, and a third quantity can be used to define the gray level. We have used the same quantity in all three roles to simplify exposition and interpretation. ful comments. We are grateful to those who shared data with us. Yahoo! Inc. provided the music ratings data under the Yahoo! Webscope program. Fereydoon Safai of Hewlett-Packard Labs facilitated our use of the Snapfish data. The Wikipedia data came from David Gleich. The other data came from Jure Leskovec.
This work was supported by the U.S. National Science Foundation under grant DMS-0906056 and by a National Science Foundation Graduate Research Fellowship.

Appendix: Proofs
Section 7.1 proves our two results on entities that either jump ahead or slip behind their proper place in the data ensemble. Section 7.2 proves Theorem 3 on the saturation model. Finally, Section 7.3 proves Theorem 4 on properties of our bipartite preferential attachment model.

Accuracy of the bulk ordering
We begin with a lemma that generalizes a bound of Shorack and Wellner (1986, page 485) on the Poisson tail probability, to certain tail factorial moments. For integers p ≥ 1, define X (p) = X(X −1) · · · (X −p+1). The p'th factorial moment of X is E(X (p) ).
Lemma 1. Let X ∼ Poi(λ) and let p ≥ 1 be an integer. Then for integers t > λ, Proof. For t ≥ p, Now t!/(t + ℓ − p)! ≤ t p−ℓ holds for integer ℓ ≥ 0, trivially for ℓ = p and by simple direct arguments in cases ℓ > p and ℓ < p. Therefore If t < p, then E(X (p) 1 X≥t ) = E(X (p) 1 X≥p ) and the result follows as before. Shorack and Wellner (1986) give the bound P(X ≥ t) ≤ (1 − λ/t) −1 P(X = t) which can be interpreted as the p = 0 version of Lemma 1. The case with p = 2 is useful for bounding the variance of J(τ, σ) defined below. We omit that computation for reasons of space. The case of most interest to us has p = 1. Then t > λ ≥ 0 implies t ≥ p and so equation (4) becomes We will also use Gautschi's inequality (Gautschi, 1959) on the Gamma function, which holds for x > 0 and 0 < s < 1. Now we are ready to examine the amount of data jumping over thresholds. Let τ > 0 be a threshold. Let σ < τ be a second threshold. Then is a normalized version of the total count from entities with mean below N σ that have 'jumped' up to the threshold N τ > 0. The indices of the jumping observations are i ≥ n(σ) where n(σ) = min{m ≥ 1 | θ m < σ}. We may assume that n(σ) ≥ 2 because X 1 has no other entity to jump ahead of. Now by equation (5).
For the Zipf-Mandelbrot-Poisson ensemble, after making the substitution y = N (x + k) −α .
Applying the bound for I we get Gautschi's inequality (6) gives Γ(N τ − 1/α)/Γ(N τ ) < (N τ − 1) −1/α , once N > 1/τ , and then Now we turn to the amount of data from entities that have slipped below their proper places. Here is a normalized version of the total count from entities with mean above N τ that have 'slipped' below the threshold N σ, where once again τ > σ ≥ 0. We will use the following result which is equation (9) of Shorack and Wellner (1986, page 485). It then easily follows that E(X p 1 X≤t ) ≤ t p P(X ≤ t) ≤ (1 − t/λ) −1 P(X = t)t p .

Proof of Theorem 3
We will make use of the following integral.
Lemma 2. Let α > 0, β > 1. Then For the lower bound we will use the following elementary inequality, valid for all real x, An application of (12) to E(X i• ) yields Each term on the right-hand side decreases as j increases, and so as desired.

Proof of Theorem 4
The results will mostly be established via application of two lemmas in Durrett (2006) along with arguments adapted from Bollobás et al. (2001). Lemma 4 (Azuma-Hoeffding inequality). Let X t be a martingale with uniformly bounded increments. Then where c is the bound on the martingale increments.
We begin by observing that at any time t, there are exactly t edges in the bipartite graph. We will focus on the analysis of M (k, t), keeping in mind that the results for N (k, t) are entirely analogous, except that we replace the sampling probability p with 1 − p.
First consider M (1, t), i.e., the number of vertices in M at time t with a single edge. At time t + 1, we either add a new vertex of unit degree with probability p, or preferential attachment is performed on M with probability q = 1 − p. Hence E(M (1, t + 1) − M (1, t)) = p − q t E (M (1, t)).