Experiments as Instruments : Heterogeneous Position Effects in Sponsored Search Auctions ∗

Google and Bing employ generalized second price (GSP) auctions to allocate billions of dollars of sponsored search advertising. The theoretical work establishing the VCG equivalence of the GSP hinges on strong, and largely untested, assumptions about the causal impact of ad position on user click probability and the value of resulting clicks. We re-purpose internal business experimentation to test these assumptions using a broad cross-section of advertisers. We find a minimal, homogenous impact of position on average click quality, providing a valuable assurance in that domain. In contrast, we find substantial heterogeneity of the impact on click through rates. For brand queries, off-brand competitors show much steeper click curves than the advertiser matching the query. For generic queries, higher quality and less popular websites benefit more from position. These findings strongly reject the conventional multiplicatively-separable model and the associated VCG equivalence, raising serious question about the efficiency properties of the GSP in sponsored search. ∗We would like to thank the Microsoft Corporation for allowing us to publish this work and Jim Andreoni, Susan Athey, Dean Foster, Sebastién Lahaie, R. Preston McAfee, Andrey Simonov and Vasilis Syrgkanis for helpful discussion and comments. †Much of this work was done while Goldman was an intern at Microsoft Research.


Introduction
Sponsored search links-the paid advertisements on a search engine results page-generate over forty billion dollars annually for search engines.Google and Bing/Yahoo!1 use a generalized second price auction (GSP) to allocate and price ad slots using a "pay per click" bidding model.Since payment is contingent on a user clicking, ranking is not based simply on the contingent bids, but also an advertisement's "clickability" and relevance to the query.These parameters directly impact efficiency and must be estimated by the platform.Despite these substantial departures from classical auctions, the seminal works of Edelman et al. (2007) and Varian (2007) show that under certain assumptions the GSP is payoff equivalent to the dominant strategy Vickrey-Clarke-Groves (VCG) mechanism.It is hard to understate the importance of this result in both academia, where the papers have been extensively cited, and industry, where the authors hold, or held, prominent positions at search engines.2By linking payoffs in the potentially problematic GSP, which evolved out of a simple rank-by-bid first price auction designed in the late 90's, to the VCG, a virtual gold standard in mechanism design, the authors provided valuable reassurance about a mechanism that was in place largely due to historical accident.Yet this result did not come without some fairly strong assumptions such as complete information, which is challenged in Athey and Nekipelov (2010) and the "envy-free" equilibrium refinement, which is reexamined in Gomes and Sweeney (2012) and Lucier et al. (2012).Proponents of the GSP have countered that these assumptions usefully approximate long-run play and the VCG-equivalent equilibrium has remained popular.In this paper, we empirically assess another assumption required for these results to hold, namely the separability of the click curve: the impact of moving up the page (towards the top slot) is assumed to scale click-through-rate (CTR) by the same multiplicative factor for all advertisers. 3In other words, the search engine estimates the differential value of each slot via a single, multiplicative click curve.This curve is used to convert observed CTRs into "baseline clickability," which are, in turn, used to predict the CTR of ad-position pairs in any proposed allocation.It is clear that the accuracy of the estimated click curve is essential for the GSP to achieve an efficient allocation.If a single click curve does not apply to all advertisers: 1) the notion of a scalar-valued "baseline clickability" loses meaning and efficient allocation of multiple ad positions by a GSP-style mechanism is impossible, 2) for any chosen definition, "baseline clickability" will be miss-estimated, and 3) there are no known equilibrium properties of the GSP and certainly no VCG equivalence.However, estimating advertiser-specific position effects is difficult because position is an en-dogenous outcome of the auction.Ads are ranked higher when they are expected to get more clicks by the search engine (because of pay-per-click pricing) or advertiser (due to events like sales or geographic targeting).Our approach uses experiments conducted as a part of normal business practice as instrumental variables.These experiments randomize users into conditions to test new algorithms, such as those applying to click prediction or relevancy score estimation.Since the experiments were not designed to induce variation in ranking per se, we have to address a variant of the weak instruments problem.Intuitively, an experiment is only "turned on" when it induces a different ranking of ads from the "control" for a given query.This shuffling only occurs when an advertiser's bid and clickability metric drift into a region where the employed ranking algorithms "disagree."A variety of regularization approaches exist to select a set of instruments based on first stage fit, but most are designed for a large, fixed class of candidate instruments (Belloni et al., 2012;Okui, 2011).In our case, the strength of each instrument is moderated by a continuous covariate, time, so a natural solution is the regularization approach presented in Carrasco (2012), which nonparametrically constructs optimal instruments that apply more weight to time periods when an experimental condition has higher relevance.The approach is computationally efficient and allows us to estimate position effect curves for many of the top 20,000 (by revenue) advertiser-query pairs on Bing.In contrast, nearly all empirical work on sponsored search uses a handful of queries or works with one focal advertiser.
Our results are unequivocal: there is substantial heterogeneity in position effects across advertisers.The heterogeneity is not only highly statistically significant, but the magnitudes of the differences are large.The impact of position varies by more than 100% across three key mainline types we identify: 1) on-brand ads appearing on brand queries (ex. a Samsung ad on "Samsung smart phone") 2) off-brand ads on brand queries (ex. a Nokia ad on "Samsung smart phone") 3) ads on non-brand, product queries (ex.any ad on "smart phone").Within each segment, position effects vary as well.For product queries, high quality (based on independently gathered user engagement metrics) and less well-known websites benefit more from a higher position.In other words, quality is a compliment to position whereas popularity is a substitute.For brand queries, the features of the on-brand advertiser are not systematically related to position effects.
For the off-brand advertisers, less-well known firms benefit more from position, but the magnitude is smaller than for product queries.
We also investigate the homogeneity of click value assumption that is maintained in the foundational papers and subsequent work, namely: the value of a click is assumed not to depend on an ad's position.Three recent papers examining the impact of position on the probability of a sale ("conversion rate"), all focusing on a single advertiser, reach different conclusions.One finds a positive impact of moving up the page (Ghose and Yang, 2009), one negative (Agarwal et al., 2011) and one zero (Narayanan and Kalyanam, 2014).Further complicating matters, theories of search and the psychology of choice can produce arguments for all three patterns (Lana, 1963;Brunel and Nelson, 2003;Joachims et al., 2005;Kempe and Mahdian, 2008) as does work on equilibrium allocation under different models of consumer search (Jerath et al., 2011;Athey and Ellison, 2011).
Using the same methodology as described above, we estimate a statistically significant, positive impact of moving up the page on conversion rate in aggregate, but the size of the effect is less than two percent of the baseline mean.This may be interpreted as a (less than) 2% change to the value of a click.Since our standard errors are smaller than one percent and we do not find significant heterogeneity we conclude that the homogeneous click value assumption is not problematic for GSP efficiency.Moreover, the scale of our analysis allows us to resolve an outstanding inconsistency in the literature.
So while the homogeneity of click value assumption holds approximately, the separability of the click curve fails decisively.This calls into serious question the efficiency of one of the most celebrated economic mechanisms of recent times.Billions of dollars are likely being lost every year as compared to VCG, which is perfectly capable of incorporating heterogeneous position effects.
Making such a serious claim requires the scale and precision our study was designed specifically to deliver.
Stepping back a bit, our analysis is relevant to a broader class of contingent-bid mechanisms used online.Content providers like Facebook, Twitter, and LinkedIn and search platforms like Kayak and Yelp all sell advertising on a per-action basis (usually the action is a click).The popularity of this bidding format over a per-impression framework where each advertiser must determine how impressions map to actions implies that platforms have a better understanding of this mapping than advertisers would in a decentralized setting.This could be because the platform can use data from all advertisers to inform estimates for any given advertiser or due to simple economies of scale in estimation (or both).Both factors improve the value proposition for advertisers compared traditional media, such as television, where action-contingent payments are not currently possible.
But the efficiency of the "centralized solution" relies on the platform choosing the right mechanism and accurately estimating the relevant quantities.And while the GSP remains a popular choice, Facebook famously opted for VCG indicating the situation is certainly fluid.New providers face an important mechanism design choice and should carefully consider empirical evidence, such as estimates contained in this paper, in making this choice. 4hifting to a methodological perspective, we show that OLS, even with advertiser-specific controls for time-related confounds, is biased.Both of our IV methodologies estimate click curves that are significantly flatter on average, as would be expected based on the endogeneity of position.
Additionally, Dummy TSLS produces standard errors about three-fold larger than our preferred Nonparametric IV estimator, and the first stage regression was not viable most of the time.The regularization approach was thus crucial for our ability to make useful generalizations.More broadly, "experiments as instruments" has many potential applications since active business experimentation is becoming increasingly common.5A company that experiments can easily tally thousands of randomized control trials per year.We show how this pool can be usefully re-purposed for causal inference on applications where direct experimentation is infeasible, or as in this case, financially prohibitive.Since these experiments only impact ad ranking for some values of time, we relied on a recent advance to this problem that was ideally suited to regularizing over this continuous underlying dimension moderating instrument strength.Related approaches, such as Belloni et al. (2012) and Okui (2011), would be natural to apply in discrete settings.
The remainder of this paper proceeds as follows: in Section 2 we explain the rules of sponsored search auction in detail, in Section 3 we describe the data and platform experimentation, in Section 4 we discuss our methods of estimation, results are presented in Sections 5 and 6, and a discussion and conclusion follows in Section 7.

The Generalized Second Price Auction
A sample search engine results page for a popular commercial query is displayed in Figure 1.It is divided into three key areas: "north" or "mainline" sponsored listings (ads), algorithmic results (the information retrieved for the query by the search engine) and "east" or "sidebar" ads which are less prominent and displayed further down the page.Mainline ads, if shown, are located directly above the algorithmic results and constitute the vast majority of revenue for the search engine.
During the time period of study, the major U.S. search engines display 4 mainline ads (and never more than 4) on most popular commercial queries.
These four ad slots can be thought of as consecutively diminishing prizes.In this spirit, they may be referred to as (in descending order) ML1, ML2, ML3, and ML4.These are allocated according to a generalized second price auction (GSP) as follows.For a given query, each ad i is assigned a "rank score" (s i ) according to where b i is bid, q i is an index of ad quality -generally taken to be an advertiser's "baseline clickability" as estimated by the search engine and discussed in the previous section -and α gives ranking parameters set by the search engine6 .A common formulation is given by Here, if α is set to 0, the auction is rank-by-bid.If α is set to 1, the ranking is by value per search.The revenue optimal α will depend on the joint distribution of q and b, but usually lies somewhere in between these extremes (Lahaie, 2006;Athey and Nekipelov, 2010).Ads are ranked by s i and the top four 7 are shown.The advertiser in slot k is assigned the smallest per click payment that maintains her superior rank score: s k > s k+1 .That is, her costper-click (CPC) is given by where S −1 gives the bid necessary for an advertiser of given quality to reach a certain rank score and r * is a reserve value set to anchor the auction and insure that the advertiser in the final mainline 7 More generally, the number of ads may range from 0-4 depending on reserve prices, auctioneer policies, and the number of qualifying bidders.( Typically p i,k is assumed to be a function only of the quality index (q i ) and a multiplicatively separable slot effect (µ k ).That is, for any advertiser i, and the values of µ are said to give the shape of a global click curve, an example of which is shown in Figure 2.
This formulation is maintained both in the theoretical literature and is part of standard industrial practice for estimating q i from observed data.It is referred to herein as the assumed separability of the click curve.The theoretical literature also typically assumes bidders have a linear preference for clicks given by v i .Notable here is that the value of a click is not allowed to depend on the position k in which the click occurs.We refer to this assumption as the homogeneity of click value.Both the separability of the click curve and the homogeneity of click value are core theoretical properties of these auctions that underlie their VCG equivalence (Edelman et al., 2007;Varian, 2007).

Data
Our data covers the outcomes of billions of commercial searches on bing.com by English-speaking Americans from April through July of 2013.From this, we draw the 20,000 unique mainline combinations from the top of the revenue distribution for that period.8For each search, we have the participating bidders, quality and relevance parameters attached to each bidder, the ranking and pricing parameters chosen by the auctioneer, serving logs, click logs and conversion tracking (for a subset of advertisers).Importantly, we do not have access to any geographic or individuallevel information about where the query was served.These latent variables may impact both ad position and click probability and present a potentially problematic source of endogeneity.However, each user is randomly assigned to an experimental traffic cell in which the method of ad ranking and display may be altered.As discussed in Section 3.2, many of these experiments will meet the criteria of being valid instrumental variables.
Each ad that is displayed on a results page is referred to as an impression.An ad click occurs when a user selects one of these advertisements (as opposed to one of the unpaid search results) and an ad's click-through-rate (CTR) refers to its probability of getting a click per impression.After a consumer makes a click, dwell time measures the amount of time she spends on the resulting page before hitting the back button.If the consumer never returns to the results page, then dwell time is not recorded and will be said to be "infinite."Additionally, a subset of advertisers choose to report conversions which indicate whether or not consumer's met some advertiser-defined standard of participation at the advertiser's website.The definition of a conversion and the resulting conversion rate will vary based on this definition, so our main analysis will use a normalized version of this metric.Pooled averages of these metrics for ads in each position are displayed in Table 1.9  Ads displayed in the first mainline spot are clicked on much more often and tend to have much more engaged clicks (higher dwell time).Conversion rate is not higher in the top slot, but dwell time is, indicating that differing conversion standards are likely playing a role here.The first mainline slot generates a majority of revenue in our sample.These advertisers are very popular and get roughly two-thirds of all ad clicks.However, given the pricing rule used in the GSP, a higher clickability not only improves ranking but also reduces cost-per-click.The net effect, which may seem surprising at first, is that clicks in the first position generate the least revenue per click for the queries in our sample.

Measures of User Engagement
We use two measures as proxies for the value of a click to an advertiser: conversions and dwell time.
Conversions are only tracked for a subset of advertisers and provide binary information about user engagement.For some advertisers, a conversion may indicate only an expensive purchase, whereas for others it may indicate merely that a user completed a free registration or signed up for an email list.As such, baseline conversion rates can vary dramatically across advertisers.However, for a given advertiser, changes in conversion rate are meaningful and we will interpret a x% increase in conversion rate as representing a corresponding x% increase in average click value.
Dwell times are available for all advertisers and are a continuous measure of the time between a user's initial click on a sponsored link and her eventual return to the results page.If a user never returns, the dwell time is labeled as "infinite."Intuitively, long and infinite dwell times are desirable as they indicate better matches and more engaged users, but they are not directly interpretable.
We thus calibrate them using conversion data by estimating the relationship between conversions and dwell time according to for a click that has occurred on ad i on impression t. α i is an ad-specific effect allowing for heterogeneity in baseline conversion rates and the nonparametric function φ gives a multiplicative shock to conversion probability based on an observed dwell time.This model lacks some generality, but We define mapped dwell time as the additional regressor, md ≡ φ(d).This can be thought of as a normalized conversion probability, that is a x% increase in the average value of md as representing an x% increase in any given advertiser's conversion probability which we, in turn, interpret as an x% increase in click value.

Experimentation
As is evident from equation (2), search engine revenue is directly effected by the parameters of the rank score function (Lahaie, 2006;Athey and Ellison, 2011).This motivates constant experimentation with new click prediction, relevance, and ranking algorithms.At a given time, Microsoft's Ad Center is operating many experiments on the Bing platform, each of which alters one or more parameters that govern ad ranking or display.Search results pages are randomized into these experiments at either the user or search level.Randomization is based upon a mathematical operation on a user's browser cookie, which assures that searchers are randomized identically, regardless of any geographic or individual-level factors.Additionally, advertisers have no control over which experiments they are involved in and cannot bid differently by experimental condition.The probabilistic weights that determine how many users go into each experiment are altered over time as older experiments expire or are expanded and new ones are introduced.Thus, experimental assignment is exogenous to a host of potential advertiser related confounds, but only after conditioning on time.
These experiments have a wide range of goals, such as improving user experience through more relevant ads, improving ad quality estimation, or changing the ranking of ads to maximize revenue.Although the experiments were designed for many proximate reasons, many have the effect of shuffling the ranking of ads.This is intuitive; the ranking of ads is the primary impact the auctioneer has on the marketplace.We label the largest experimental cell as our control and use the other experiments as instrumental variables for causal inference.A challenge we have to address is that in addition to experimenting with alogrithmic components of ad serving that impact ranking, the search engine also tries out different visual displays of ads, which can directly effect user behavior.These experiments produce endogenous IVs in our causal model and must be carefully pruned.Prior to any estimation, these experiments can be detected if they produce a statistically distinct click probability (relative to the control) while an ad is held in a fixed position.Figure 4 gives such an example where an invalid instrument is clearly identified.This advertisement was exclusively listed in the first mainline position in all three experimental conditions.
The green (labeled as valid) experiment gives statistically identical click probabilities to the control indicating that it may be a valid instrument.However, the red experiment gives significantly larger click probabilities indicating that the ad was displayed differently in that condition.This pattern of results, would leave us to conclude that data from the red experiment should be removed from our analysis.If an experiment is shown to form an invalid instrument for a statistically significant subset of our ad-query pairs, it demonstrates the invalidity of that experiment globally.Scaling the analysis presented in Figure 4, gives us much greater power to prune invalid experiments.10 After pruning invalid experiments and aggregating experiments into relevance clusters, each of which produces identical distributions of ad position across all 20,000 ad-query combinations, we are left with 40 unique experiments.A histogram of this distribution can be found in the Appendix.

Auxiliary Regressors
In order to understand heterogeneity in position effects, we collect a number of additional regressors for our 20,000 mainline combinations.Most importantly, we label each query as either "brand" or "product."Brand queries are specific to one particular company or manufacturer, such as "Samsung smart phone," while product queries are more generic, such as simply "smart phone." Within the class of brand queries, we further differentiate between a "on-brand" advertiser that represents the brand in the query and "off-brand" competitors.On-brand advertisers are identified by matching the brand term in the search query to the web address an ad points to. 11  We also use the destination URL that each ad points to in order to obtain additional regressors through Alexa.com. 12These additional regressors are intended to characterize the overall popularity and quality of these websites.Of particular interest, are the regressors US Rank, which gives a rank in overall popularity in terms of the number of unique visitors per day and Bounce Rate which gives an idea of how frequently visitors to a given website are dissatisfied and leave very quickly.
Table 2 provides a summary of our auxiliary regressors and their correlation matrix is given in the appendix.

Estimation
Our estimation procedure is designed to produce estimates at mainline pair level.For a given query, Each ad could be displayed in any of four mainline position or the in the sidebar.Our parameters of interest are the impact on click probability of moving an advertisement up one slot to position j, given by β j , and the baseline click through rate for an ad in position j + 1, given by α j+1 .In the next section, we also estimate these measures for click value.In order to allow for an unrestricted shape of the click curve, we will present separate estimates for the impact of each of our four positional goods.For example, for j = 4, we estimate the impact of moving an ad from the sidebar to ML4.
We will consider four estimation strategies, starting with OLS, which is likely biased.For OLS, the estimating equation is This estimator simply pools all the data for a particular mainline combination.It improves on the summary statistics of the previous section by removing confounds that vary across advertisers, regardless of time, but it is still vulnerable to any time-varying shocks to click probability (U ) that also correlate with ad position.In particular, one may expect that advertisers are placed in better ad positions during periods of high clickabillity.This could happen mechanically through the search engines estimation of quality or if U is positively correlated with changes to an advertiser's bid.In both case we expect overestimate β j , which we call a steepness bias of the click curve.
To help control for these confounds, we consider a second specification that includes fixed effects at the four-hour level. 13The corresponding equation is Time FE: This specification should effectively control for any shocks to time-varying clickability and to an extent mitigate the steepness bias in (3).However, we may still be confounded by endogenous shocks at the geographic level (location of the user is inferred from the IP address).This could come in two forms: (1) competitors geo-target their ads to some locales but not others in an endogenous way or (2) geo-specific estimation of ad quality leads ads to be ranked higher where they are most popular.Intuitively, both of these factors would lead to steepness bias.Thus, we may expect a smaller, but still positive, steepness bias in estimates of equation ( 4).
While, we cannot directly observe these potential geographic confounds, we can still take advantage of our experiments to do robust causal estimation.The simplest approach is to create a vector of dummy variables for the presence of a given impression in each of our 40 relevance clusters 14 in our data.This vector Z is assigned randomly and guaranteed to be orthogonal to all possible confounds at a given point in time.Thus, we consider a Dummy TSLS approach given by a standard TSLS procedure with the estimating equations in (5).
Dummy TSLS: Crucial to the implementation of this approach is that experimental conditions can only be used during time periods for which they place every single impression of a given ad in either position j or j + 1.That is to say, the population of users is held constant-the ad location varies by one slot at most.In equations ( 3) and (4) we assumed that the contemporaneous distribution of U would not vary with ad position and it was sufficient to simply eliminate those impressions in which the ad was not placed in position j or j + 1. Doing so here, could compromise the validity of our instruments if the selected sample differs in some systematic way. 15 Given these concerns, we eliminate experiments that do not respect this inclusion restriction.
Cutting down our data like this combined with the weak instrument problem made it very difficult 13 Due to computation issues related to the volume of data, we were not able to perform a fixed effect analysis at any more narrow intervals.
14 Recall that each relevance cluster, as defined in Section 3.2, is a collection of experiments that produce identical ad rankings.
15 An alternative approach involves using our instruments to estimate a multidimensional IV model so that experiments that place ads in more than two positions can be used.However, relevance in such settings is a much steeper challenge for our instruments and can complicate identification.See the Appendix for a brief discussion.Various bounds on parameter values and the distribution of unobservables can provide sharper inference, efficiently incorporating these may be a useful direction for future work.
to obtain sufficient first stage relevance using the dummy TSLS approach. 16To increase power, it is necessary to allow more flexibility in the first stage fit, thereby increasing the weight on instruments when they are "turned on."Given the time-varying strength of our instruments, a logical way to accomplish this is to consider interactions between experimental condition and time as additional first stage regressors.
A variety of regularization approaches exist to select a set of instruments for a good first stage fit, but most are designed for a large, but fixed, class of candidate instruments (Belloni et al., 2012;Okui, 2011) and become computationally intensive because the discrete interaction terms make the data set very wide.An ideally suited solution for our setting is to instead generate instruments via a nonparametric smoothing of the conditional distribution of ad position for each experimental condition.These nonparametric IVs effectively "turn on" only for time periods where experiments have a contemporaneous impact on the distribution of ad position.This procedure is developed by Carrasco (2012), who shows it to be optimal for econometric models based on conditional moment restrictions with continuous covariates.
An example of how the first stage fit of a procedure based on these nonparametric instruments compares to our Dummy TSLS specification is shown in Figure 6.Nonparametric IV improves first stage fit by allowing one to sharpen the focus on natural experiments when they actually occur.
Dummy TSLS flattens the experiment across the entire time period under study.This is equivalent to using an infinite bandwidth in the smoothing procedure.For the example shown, this adds noise to the estimates from time periods during early May and July when our experiments did not impact ranking.Formally, the estimating equations for this model are presented in (6).

Nonparametric IV: P
The key difference is that the first stage is allowed to be a nonparametric function of both Z and t.As mentioned, Dummy TSLS is nested in this approach as it is equivalent to an infinite bandwidth.Bandwidth selection is clearly the key knob to for this method.Sending bandwidth to zero results in a "perfect" first stage fit and essentially returns an OLS estimate since the predicted X for the second stage regression is arbitrarily close to the observed X. Accordingly, very small bandwidths should be avoided as they generate exactly the kind of endogeneity bias we seek to avoid.17In choosing bandwidth, we thus took a conservative approach and opted for 1 week for all estimations.This was shown to induce a bias of less than 1% of the treatment effect in our worst case simulations allowing us to be confident in the inference of casual effects.18 5 Results: Click Curves We estimate the four specifications of Section 4 on each of the 20,000 ad-query combinations in our data.Many advertisements did not spend much time in certain positions.For these, our effects of interest are not identified and estimates will not be presented.For many other ads, treatment effects could be estimated in our OLS or Time FE models, but lacked sufficient relevance for identification in the Dummy TSLS or sometimes even the Nonparametric IV model.

Endogeneity of OLS methods
In order to get a baseline sense of the extent to which endogeneity corrupts our more naive estimators, we summarize the estimation results for those queries that could estimated by all four techniques in Table 3. Treatment effects are summarized for each position by their simple and weighted average. 19ocusing on the first two rows, we see that, for all positions, the average estimated treatment effects is larger in OLS than in our Time FE method.This is statistically significant and especially true for ads near the top of the page, but is only marginal for ads moving from the sidebar to the last mainline ad.Going from the second to the third and fourth rows, we again see a general decline in average treatment effects, but this time the effect in concentrated on the less popular ads at the bottom of the page.In combination, these results strongly suggest that naive OLS estimation of click curves is significantly biased toward steepness, which is exactly the type of endogeneity one would expect given GSP mechanics promote ads during periods of high popularity.Furthermore, it is perhaps intuitive that the endogeneity in the more popular ads at the top of the page seems to be driven by time-varying factors-perhaps related to the overall popularity of the brand/site-while endogeneity for ads further down the page is driven by geographic factors.
In order to expand the analysis to more ads, we dropped the Dummy TSLS method, the most restrictive method in terms of coverage, from the comparison.This allows us average over many additional ad-query pairs that did not meet the first stage relevancy requirement, or had very large standard errors, in the Dummy TSLS analysis.The pattern of results is displayed in Table 4 and is consistent with those found above.The fact that the number of ads estimated triples is a testament to the superior efficiency of the Nonparametric IV approach.

Heterogeneous Click Curves
Given the bias in OLS and noisiness of the Dummy TSLS, we focus attention on the estimates derived from our Nonparametric IV method in order to explore heterogeneity in click curves.First, we look at the causal impact of moving an ad from the second to the first mainline position.Since most of the revenue in this auction comes from the first slot, it forms a natural starting point.
Figure 7a plots the impact on CTR of moving to slot 1 on the y-axis against CTR in the second † †Weights chosen identically across estimations and to be most friendly to Nonparametric IV. mainline position on the x-axis.We differentiate between our three types of ads (ads on product queries, on-brand ads, and off-brand ads) and compute inverse variance-weighted GLS lines of best fit for each group.The graphical presentation is motivated by our interest in the hypothesis of separability of the click curve.Separability has the strong implication that all the dots on this graph should be along a ray coming out of the origin, the slope of which defines the (constant) ML1 position effect.Corresponding GLS regressions of these estimated treatment effects onto baseline CTR and our auxiliary regressors are presented in Table 5.  Figure 7a immediately reveals three substantial rejections of the separability assumption.First, our three types of ads have clearly distinct slopes, with off-brand ads having the steepest position effect and on-brand ads having the flattest effect.The overwhelming statistical significance of this difference is confirmed the different estimates on α2 obtained in specification (1) of Table 5.Second, the GLS line for on-brand ads does not appear to go through the origin, but rather has a positive intercept of around 7%.This means that on-brand ads with low baseline clickability experience a greater multiplicative return to position than those with higher baselines.The statistical significance here can be checked by the non-zero constant estimated in specification (2) of Table 5 for on-brand ads.
A final rejection of separability is that even within ad type, there is substantial, statistically significant heterogeneity in the slope of click curves.This is evident from Figure 7b, which presents a scatterplot of the estimated multiplicative position effect for each ad against the corresponding standard error.The dashed black lines start from the global average and have a slope of ±1.96, so they represent a 95% confidence interval around the group mean.Multiplicative separability implies that no more than 5% of dots should be observed outside these lines.
Additional clues as to what is causing the excess heterogeneity can be gleaned from specification 3 in Table 5.All else equal, the ML1 treatment effect is smaller for product ads and off-brand ads that have a low website rank (meaning they are more popular).ML1 position and the popularity of a website thus work as substitutes for the purposes of attracting more clicks for these types of ads.Additionally, the position effect for ads on product queries was shown to be larger for ads with a low Bounce Rate.A lower Bounce Rate indicates fewer dissatisfied visitors and is generally regarded as an indicator of higher website quality.Interestingly, this means that for product queries the ads that benefit the most from the ML1 position are those that are not very popular but tend to generate high levels of user interest based on engagement metrics from Alexa.com.We now replicate the above analysis for positions 2-4.As previously noted, on-brand ads were rarely observed in lower positions on the page and thus we could not identify the rest of the click curve for these type of ads.As such, our analysis for the remaining positions will focus only on ads on product queries and off-brand ads on brand queries.Figure 8 collects scatterplots of these treatment effects for each of these 3 positions in analog to the results of Figure 7 and auxiliary GLS regressions are in the Appendix.The results generally adhere more closely to separability, but still have significant violations.Most obviously, for all three positions, off-brand ads again have significantly steeper position effects than ads on product queries.Also within each category, we still see statistically significant heterogeneity for each treatment effect as represented by the excess dispersion of dots in the right panel of each row.The "most separable" treatment effect is for ads moving from ML4 to ML3.Here the average ad had a 24% multiplicative bonus and 238 out of the 264 estimates with standard errors < 0.7 were statistically indistinct from the group level average.However, this is still only 90.15% and allows us to reject a null of no heterogeneity with p = .0017(although we note the observed differences are much less economically meaningful than for the other positions).
Figure 9: Average position effect by ad type.On-brand ads were very rarely observed outside of the first two positions.Reliable estimates of their full click curve were difficult to obtain and are not presented.
Finally, to visualize the differences we aggregate the results to graphically show the average click curve for our three types of ads.These are displayed in Figure 9.This summarizes the data ignoring within-class heterogeneity and could be a useful approximation for a search engine to employ.Placed on the same plot, the large degree of heterogeneity in click curves is stark.

Results: Click Value
We now turn our attention to estimating the impact of ad position on the value of a click.We'll use the measure of mapped dwell time, defined in Section 3.1, as our primary outcome metric.Recall that this metric is defined in order to be representative of proportional changes to conversion rate and thus click value.This gives it an easy interpretation that is directly relevant to an advertiser's decision to change their bid in order to seek (or avoid) a given position, namely the percentage change in the average value of a click.Further the definition of a conversion varies arbitrarily across advertisers (they are free to define it any way they wish).By contrast, this measure facilitates fair cross-advertiser comparisons.
The results for positions 1 and 2 are summarized for all four methods in the first 4 columns of Table 6.Since we only observe value metrics conditional on a click, we cannot estimate the IV methods for positions 3 and 4 (hence the "n/a's" in the table) since CTR tends to be low in these positions.Focusing on the top position, the non-parametric IV weighted mean indicates that the impact of moving from slot 2 to slot 1 is a 1.7% increase in conversion rate over an advertiser's baseline in slot 2. The standard error is 0.50, meaning we can tightly bound the mean effect to be < 3%.So while we do document a positive impact on click value of moving up the page, the magnitude is very small.For ML2, the results are not statistically significant but again can be bounded near zero.
We examine the impact of moving from the sidebar to the mainline ads in the final 4 columns Table 6, but have to rely on OLS estimates, which are expected to have positive bias.We find a significantly negative impact of moving to the mainline.Comparing the ML4 bonus to the other three positions, we see that it is the only one to show an economically meaningful departure from zero.Given that sidebar are simply those ranked 5th or higher, we have no reason to expect a dramatically different endogeneity bias for these estimates and feel safe concluding this is representative of a causal effect.20A simple explanation is that this result reflects the changing population of clickers when an ad moves to the mainline.Combined with the very low CTR on sidebar ads, this may indicate that users who have gone to the the trouble of finding a sidebar ad, simply have a higher purchasing intent.However, since this result relies on OLS, there is naturally some uncertainty remaining.

Discussion and Conclusion
There has been extensive research on the revenue properties of the GSP in sponsored search.The seminal papers of Edelman et al. (2007) and Varian (2007) show clearly that under certain assump- tions, all envy-free equilibria of the GSP obtain at least the revenue of the VCG mechanism.21 Chief among these assumptions are the separability of the click curve and the homogeneity of click value.These have been tested empirically in recent papers (Agarwal et al., 2011;Ghose and Yang, 2009;Narayanan and Kalyanam, 2014), each focusing on a single advertiser.These papers reach contrasting conclusions on the issue of click value and, due to their limited sample, could not provide strong evidence on separability.Our paper provides robust causal inference on a wide cross-section of advertisers to test both assumptions at scale.Our results provide valuable reassurance for the homogeneity of click value, but strongly reject separability of the click curve, and with it, many desirable theoretical properties of the current GSP approach to position auctions.
It is immediately clear that this false presumption of separability can induce significant costs.
As a brief exercise, consider two otherwise-identical advertisers (B for brand and O for off-brand) competing for the first mainline slot and let us normalize their click curves to be one in the second position, µ O 2 = µ B 2 = 1.Recall that in Specification (2) of Table 5, we estimated that the average off-brand advertisers has a multiplicative bonus of 62% when shifting to the first mainline position (µ O 1 = 1.62), while the average on-brand advertiser only gets a 30% bonus (µ B = 1.3) from being in the top slot.Supposing a knife edge case, in which the GSP barely decides to rank the second ad on top, we see that combined surplus from the top two slots could be as much as 1 less than would be achieved in an efficient allocation.The exact loss of revenue would depend on the rank score of the third place bidder, but could range anywhere from the estimated 13% to as much as 19%, if the third ranked bidder has a rank score of 0. So even if we ignore all click curve heterogeneity beyond just brand vs. off-brand averages, as much as 19% of revenue could be lost on some queries by improperly placing on-brand ads in the first mainline position over better suited, off-brand competitors. 22Expanding the analysis to include all four slots, account for additional elements of heterogeneity, or allow for strategic bidding based on these misallocations could lead to a greater calculated revenue impact.
Eliminating these losses requires the incorporation of heterogeneous click curves into the mechanism.For this, there are two hurdles.The first is an empirical challenge.This is precisely what the methodology developed in this paper was designed to address.Our analysis has identified ad-query type, site popularity and quality as important dimensions along which the shape of click curves may vary, but also has uncovered substantial unexplained heterogeneity.As such, efficient, real-time estimation of an individual advertiser's click curve should be cast in the framework of a hierarchical Bayesian model.This could efficiently combine information on the population distribution of click curves estimates for individual advertisers that can be gleaned from ongoing experimentation.
The second challenge is a theoretical one.All current equilibrium analysis-of which we are aware-presumes separability of the click curve.However, even without further theoretical inquiry, we see two options for improvement.One approach is to modify the GSP by simply incorporating estimated heterogeneity in the click curves into the calculation of an advertiser's quality.Current practice defines each advertiser's quality as their estimated probability of receiving a click in the first mainline position.This is calculated by deflating each advertisers observed click through rates by a common set of global position effects (like those presented in Figure 2) and results in exactly the type of misallocation discussed above.But, if each advertiser's click curve was known, then quality could be estimated without bias and we could be confident that the first mainline position -at least -was being allocated correctly.This approach is straightforward, would not require a fundamental change in the structure of the auction and could build directly off the estimates presented herein.
However, this approach still cannot eliminate the possibility of misallocation (and a corresponding misalignment of advertiser incentives) of slots further down the page.In fact, since any type of GSP auction must be organized around a scalar-valued index of bidder quality, it can never be certain to efficiently allocate more than one prize.Only a shift to a VCG mechanism could fully incorporate heterogeneously estimated click curves in a way that insured efficient allocation of (and incentive compatible bidding for) all four mainline slots.
More broadly, variants of the GSP are employed to allocate advertising space by many of the most popular search engines and online content providers.Their adoption, within a contingent bid framework in which advertisers pay for user action (usually clicks), makes the auctioneer the residual claimant on these actions.This could be because the auctioneer presumes a greater familiarity with the impression to clicks mapping on their platform and creates a competitive advantage over offline advertising where such action-based payment schemes are not possible.Yet we must conclude by sounding a note of caution.Such a centralized system (as opposed to simply selling impressions and letting advertisers work out their own payoff relevant factors) requires a careful understanding of the empirical realities of the auction.As demonstrated here, heterogeneity in how individual advertisers interact with the platform may need to be factored into the mechanism if efficient allocation and incentive compatibility of bidding are to be maintained.Additionally, even if all parameters are perfectly estimated, a GSP-style auction cannot be sure to achieve efficient allocation, or VCG equivalence in revenue, if multiple goods are sold simultaneously to advertiser's with non-scalar heterogeneity.

Primary and Secondary Clicks
The first click a user makes on an advertisement on a given search page is referred to as primary.
Often users return to the search page after clicking on an advertisement, thus affording them the opportunity to make secondary clicks, each of which is charged to the appropriate advertiser.
However, as demonstrated in Figure 10 this is relatively rare.
However, it is interesting to note that these secondary clicks appear to be considerably less valuable as shown in Table 8.

Heterogeneous Click Through Rates
The histograms in Figure 11 and 12 demonstrate the heterogeneity of click through rates for ads shown commonly in each mainline position.This distributions are purely observational and have no causal interpretation.

Alexa Stats
Table 9 demonstrates some correlations between our different website level metrics from Alexa.com.

Heterogeneous Click Curves with OLS Estimates
We now replicate the analysis of Section 4.2 with our OLS estimates derived from equation (3).
Results are presented in Tables 12 and 13.

Experimental Frequencies
As discussed in Section 2.2, our data is divided up into 40 distinct valid relevance clusters that differentially impact ranking.Each of these relevance clusters is composed of multiple Bing experiments.A histogram of their frequencies is presented in Figure 15.

Figure 1 :
Figure 1: An example results page for a popular commercial query.The top 4 listings are mainline advertisements.Below these are the unpaid (algorithmic) results generated by the Bing search engine.Sidebar ads are much smaller and displayed further down the right most column of the page.
it allows the φ function to take the interpretation of a percentage change in conversion probability and thus click value.The estimated value of φ is presented by the thick red line in the right panel of Figure 3 alongside a histogram of the distribution of pooled dwell times (left panel) for scale.

Figure 3 :
Figure 3: The unconditional distribution of dwell time (left panel) and an estimate of the multiplicative mapping from dwell time to click value (right panel).The blue dots in the right panel represent normalized conversion probability for the twenty ads in our data that convert the most often.

Figure 4 :
Figure 4: Local constant estimation of the average click probability over time for one particular ad in the control (green) and two other experimental conditions.
Figure 5(a) shows the pooled distribution of each type of mainline pair across position.On-brand advertisers occupy the top slot with a very high frequency.Figure 5(b) plots the corresponding pooled CTRs by position for each type.These estimates have no causal interpretation of position, rather theyshow that brand queries in general, and on-brand advertisers in particular, tend to get significantly more ad clicks than ads displayed on product queries.

Figure 5 :
Figure 5: Pooled distribution of ad impressions (left) and observed click through rates (right) for ads on product queries, on-brand ads on brand queries, and off-brand ads on brand queries.

Figure 6 :
Figure 6: First stage fits generated by our Dummy TSLS and Nonparametric IV specifications for a hypothetical ad.

Figure 7 :
Figure 7: Heterogeneous position effects for ML1: The left panel shows treatment effect of moving from the second slot (ML2) to the first slot (ML1) for each advertiser is plotted against baseline clickability in ML2.The right panel scatters the implied multiplicative effect for each ad against the corresponding standard error.Dots outside the black lines are those that are statistically different from the group level average at the 5% level.

Figure 8 :
Figure 8: Heterogeneous positions effects for ML2-ML4.This was only close in the case of the Mainline 3 position effect.Here 238/264 = .9015dots were within the black lines.This fraction is distinct from .95 with p = .0017.

Figure 10 :
Figure 10: Distribution of clicks per impression.

Figure 11 :
Figure 11: Histogram of average CTR at the advertisement level for positions ML1 and ML2.Ads are only included if they have at least 3,00 impressions in that position.

Figure 12 :
Figure 12: Histogram of average CTRs at the advertisement level for positions ML3 and ML3.Ads are only included if they have at least 3,00 impressions in that position.

Figure 13 :
Figure 13: Position effect of moving from the second slot (ML2) to the first slot (ML1) by advertiser.x-axis gives the average CTR over ML1 and ML2.In Panel 1 we show OLS estimates, which are biased.Panel 2 gives lower IV estimates which uncovers a diminishing return of position for dominant ads.

Figure 14 :
Figure 14: Position effect of moving from the second slot (ML2) to the first slot (ML1) by advertiser.x-axis gives the average CTR over ML1 and ML2.In Panel 1 we show OLS estimates, which are biased.Panel 2 gives lower IV estimates which uncovers a diminishing return of position for dominant ads.

Table 1 :
Average user behavior by ad position Dwell times are averaged conditional on being reported.

Table 2 :
Summary statistics

Table 3 :
Results summary on mainline pairs that could be estimated by all four methods.Aggregates for the sample of ads in which the Dummy TSLS model has standard errors < 5%.† †Weights chosen identically across estimations and to be most friendly to IV.

Table 4 :
Results Summary on mainline pairs that could be estimated by our OLS and Nonparametric IV method.Aggregates for the sample of ads in which the Nonparametric IV model has standard errors < 5%.

Table 8 :
User engagement on primary and secondary clicks