1 Introduction

The primary goal of Information Retrieval (IR) systems is to retrieve highly relevant documents for user’s search query. Judges determine document relevance by assessing topical overlap between its content and user’s information need. To facilitate repeated and reliable evaluation of IR systems, trained judges are asked to evaluate several documents (mainly shown on desktop computers) with respect to a query to produce large scale test collections (such as those produced at TRECFootnote 1 and NTCIRFootnote 2). These collections are created with the following assumptions: (1) document rendering is device agnostic, i.e., a document appears the same whether it is viewed on a mobile or desktop. For example font size of text, headings and image resolutions remain unchanged with change in screen size. (2) document content is device agnostic, i.e. if some text if displayed on desktop, will also be visible for instance on a mobile. Note that while the former assumes that document layout remains unchanged, the latter assumes that its content (for example, number of words, headings or paragraphs) remains the same across devices.

While this evaluation mechanism is robust, it is greatly challenged by explosion in devices with myriad resolutions. This requires optimization of pages for different resolutions. A popular website today has at least two views, one optimized for traditional desktops while the other tuned for mobiles or tablets, i.e. devices with smaller screen sizes. The small screen size limits both the layout and amount of content visible to the user at any given time. The continuous advancement in browsers and markup languages exacerbates this problem, as web developers today can collapse a traditionally multiple page website into a single web page with several sections. The same page can be optimized for both mobile and desktops with one style sheet and minimal change in HTML. Thus, with today’s technology, a user may see separate versions of the same website on desktop and mobile, which in turn may greatly impact her judgment of the page with respect to a query.

To illustrate this further, Fig. 1 shows four web pages with their respective queries on mobile as well as desktop. Web pages in Fig. 1a and b are relevant to the query and have been optimized for mobiles. Judges in our study also marked these pages relevant on mobile and desktop. However, web pages in Fig. 1c and d are not suitable for mobile screens. In case of Fig. 1c, the whole page loads in the viewportFootnote 3 which in turn makes it extremely hard to read. Figure 1d has more ads on the viewport than the relevant content which prompts judges to assign lower relevance on mobile.

Fig. 1.
figure 1figure 1

Sample queries and resulting web pages on desktop and mobile screens

Thus, it needs to be determined whether device on which a document is rendered influences its evaluation with respect to a query. While some work [7] compares user search behavior on mobiles and desktop, we know little about how users judge pages on these two mediums and whether there is any significant difference between judging time or obtained relevance labels. We need to determine whether page rendering has any impact on judgments, i.e. different web page layouts (on mobile or desktop) translate into different relevance labels. We also need to verify whether viewport specific signals can be used to determine page relevance. If these signals are useful, mobile specific relevance can be determined using a classifier which in turn would reduce the overhead of obtaining manual judgments.

In this work we investigate above outlined problems. We collect and compare crowd sourced judgments of query-url pairs for mobile and desktop. We report the difference in agreements, judging time and relevance labels. We also propose novel viewport oriented features and use them to predict page relevance on both mobile and desktop. We analyze which features are strong signals to determine relevance. Our study shows that there are certain differences between mobile and desktop judgments. We also observe different judging times, despite similar inter-rater agreement on both devices. On mobiles, we also observe correlation between viewport features and relevance grades assigned by judges.

The remaining paper is organized as follows. We briefly cover the related work in Sect. 2. Section 3 describes the crowd sourced data collection and their comparison on mobiles and desktop. We describe the features and results of relevance prediction in Sect. 4. We summarize our findings and discuss future work in Sect. 5.

2 Related Work

While there exists large body of work that identifies factors affecting relevance on Desktop, only a fraction exists that characterizes search behavior or evaluates search engine result pages for user interaction on mobiles. We briefly discuss factors important for judging page relevance on desktop and contrast our work with existing mobile search studies.

Schamber et al. [13] concluded that relevance is a multidimensional concept that is affected by both internal and external factors. Since then, several studies have investigated and evaluated factors that constitute relevance. For instance, Xu et al. [18] and Zhang et al. [20] explored factors employed by users to determine page relevance. They studied impact of topicality, novelty, reliability, understandability, and scope on relevance. They found that topicality and novelty to be the most important relevance criteria. Borlund et al. [1] have shown that as search progresses, structure and understandability of document become important in determining relevance. Our work is quite different as we do not ask users explicit judgments on above mentioned factors. We compare relevance judgments obtained for same query-url pairs on mobile and desktop. Our primary focus is the difference in judging patterns on both mediums.

There is some work on mining large scale user search behavior in the wild. Several studies [3, 5, 6, 810, 17, 19] report differences in search patterns across devices. For instance, Kamwar et al. [8] compare searches across computers and mobiles and conclude that smart phones are treated as extensions of users’ computers. They suggested that mobiles would benefit from integration with computer based search interface. These studies found mobile queries to be short (2.3 – 2.5 terms) and high rate of query reformulation. One key result of Karen et al. [4] was that conventional desktop-based approach did not receive any click for almost 90 % of searches which they suggest maybe due to unsatisfactory search results. Song et al. [15] study mobile search patterns on three devices: mobile, desktop and tablets. Their study emphasizes that due to significant differences between user search patterns on these platforms use of similar web page ranking methodology is not an optimal solution. They propose a framework to transfer knowledge from desktop search such that search relevance on mobile and tablet can be improved. We train models for relevance prediction as opposed to search result ranking.

Other work includes abandonment prediction on mobile and desktop [12], query reformulation on mobile [14] and understanding mobile search intents [4]. Buchanan et al. [2] propose some ground rules to design web interfaces for mobile. All the above mobile related studies focus on search behavior, not on what constitutes page relevance on small screens. In this work our focus is not to study search behavior but to compare relevance judgments for same set of pages on different devices.

3 Mobile and Desktop Judgments

To understand whether user’s device has any affect on relevance, we first collect judgments via crowd sourcing. We begin by describing the query-url pairs, the judging interface and the crowd sourcing experiment in detail. We use query-url pairs from Guo et al. [7], where the users were asked to perform seven search tasks similar to regular mobile search tasks from [4]. Their study collected 393 unique page views associated with explicit judgments. We filtered broken urls or search results pages. We also removed queries and corresponding pages that were temporal. First author found corresponding desktop urls manually for remaining urls. In total, we obtained crowdsourced judgments on desktop and mobile for 236 query-url pairs. We built two simple interfaces– one for desktop oriented judgments and the other for mobile specific judgments. We used Amazon Mechanical Turk (MTurk)Footnote 4 to host two sets of hits, one for each interface. Each interface had concise description of different relevance grades. We asked judges to rate each query-url pair on a scale of 1 to 4 with 4 being ‘highly-relevant’, 3 as ‘relevant’, 2 as ‘somewhat-relevant’ and 1 as ‘non-relevant’. Each HIT was payed $0.03. We collected three judgments per query-url pair. We ensured that query-url pairs were shown at random to avoid biasing the judge. We determined judge’s browser type (and device) using javascript. The judgments performed on Android or iOS phones are used in our analysis. To help filter malicious workers, we restricted our HITs to workers with an acceptance rate of 95 % or greater and to ensure English language proficiency, to workers in the US. In total we collected 708 judgments from each interface. Desktop judgment hits were submitted by 41 workers and mobile judgment hits were completed by 28 workers on MTurk.

Fig. 2.
figure 2figure 2

Relevance Grade vs Judging Time

The final grade of each pair was obtained by taking majority of all three labels. We also group relevance labelsFootnote 5 to form binary judgments from 4 scale judgments. The label distribution is as follows:

  • Desktop: High-rel=108, rel=37, some-rel=47, non-rel=44

  • Mobile: High-rel=86, rel=55, some-rel=64, non-rel=31

The inter-rater agreement (Fleiss kappa) for Desktop judgments was 0.28 (fair) on 4 scales and 0.42 (moderate) for binary grades. Similarly, inter-rater agreement for mobile judgments were 0.33 (fair) on scale of 4 grades and 0.53 (moderate) for binary grades. The agreement rate is comparable to that observed in previous relevance crowd sourcing studies [11]. However, Cohen’s kappa between majority desktop and mobile relevance grades is only 0.127 (slight), indicating that judgments obtained on mobiles may differ greatly from those obtained from desktop. Kendall Tau is also low, only 0.114 (p-value=0.01), suggesting that judging device influences judges.

Boxplots for relevance grades assigned by crowd judges and their judging time on each interface is shown in Fig. 2. The average time (red squares) crowd judges spent labeling a highly relevant, relevant, somewhat relevant and non-relevant page on desktop was 88 s, 159 s, 142 s and 197 s respectively. Meanwhile, the average mobile judging time for highly relevant, relevant, somewhat relevant and non-relevant page was 65 s, 51 s, 37 s and 48 s respectively. The plots show two interesting judging trends: Firstly, judges take longer to decide non-relevance on desktop as compared to mobile. This maybe due to several reasons. If a web page is not optimized for mobiles, it may be inherently difficult to find required information. Judges perhaps do not spend time zooming/pinching if information is not readily available on the viewport. It could also be a result of interaction fatigue. In the beginning, judges may thoroughly judge each page, but due the limited interaction on mobiles, they grow impatient as time passes and spend increasingly less time evaluating each page, thus giving up more quickly than a desktop judge. For optimized pages, the smaller viewport in mobile allows judges to quickly decide if the web page is relevant or not. For example, web pages with irrelevant ads (Fig. 1d) can be quickly marked as non-relevant. Secondly, judges spend more time analyzing highly relevant and relevant pages on mobile and desktop respectively. This is perhaps due to the time it takes to consume a page on mobile is longer, and with limited information on the viewport user has to tap, zoom or scroll several times to read the entire document.

Fig. 3.
figure 3figure 3

Time and Label comparison on Mobile and Desktop

Figure 3a shows the distribution of average judging time for documents on mobile and desktop. Since three assessors judged each document, we plot the mean judging time of each document. While a large fraction of pages is judged under 100 s on mobile and desktop, remaining documents take longer on both mediums. This is perhaps the result of the time it takes to judge non-relevance and relevance of a page on desktop and mobile. It may take longer to judge relevant documents on mobile and irrelevant documents on desktop. Figure 3b depicts the distribution of majority relevance grades on mobile and desktop. We see that several documents marked highly relevant in desktop are actually marked non-relevant on mobile. Again, this is due to the reasons (e.g. viewport full of ads, small fonts etc.) mentioned above.

4 Relevance Prediction

Relevance prediction is a standard but crucial problem in Information retrieval. There are several signals that are computed today to determine page relevance with respect to a user query. However, our goal is not to test or compare existing features. Our primary focus is to determine whether viewport and content specific features show different trends for relevance prediction on mobile and desktop. Given that non-relevant pages are small in number on 1–4 scale, we predict relevance on binary scale. We use several combinations of features to train Adaboost Classifier [21]. Given that our dataset is small, we perform 10-fold cross validation. We report average precision, recall and f1-score across 10 folds for mobile and desktop. We compute paired t-test to compute statistical significance. We begin by describing the features used to predict relevance in following subsection.

4.1 Features

We study whether there is a significant difference between features that are useful in predicting relevance on mobile and those that predict relevance on desktop. Our objective is to capture features that have different distributions on both devices. Features that capture link authority or content novelty will contribute equally to relevance for page rendered on mobile and desktop (i.e. will have same value), we ignore them from our analysis. Several features have been reported to be important in characterizing relevance. Zhang et al. [20] investigated five popular factors affecting relevance. Understandability and Reliability were highly correlated with page relevance. We capture both Topicality and Understandability oriented features in this work. Past work has also shown that textual information, page structure and quality of the page [16] impacts user’s notion of relevance.

Table 1. Features calculated for webpages on mobile and desktop

As mentioned before, screen resolutions of mobiles greatly affects the legibility of on-screen text. If websites are not optimized to render on small screens, users may get frustrated quickly due to repeated taps, pinching and zooming to understand its content. Our hypothesis is that relative position and size of text on page are important indicators as viewport size varies greatly between a mobile and desktop, thus affecting user’s time and interaction with the page. We extract features that rely on visible content of the web page, the position of such elements and their rendered sizes. We evaluate some viewport i.e. interface oriented features to predict page relevance.

Both content and viewport specific features are summarized in Table 1. Content oriented features are calculated on two levels: the entire web page (html) and only the portion of page visible to user (viewport) when she first lands on the page. Geometric features capture mean, minimum and maximum of absolute coordinates of query terms and headings on screen. Display specific features capture the absolute size of query terms, headings and remaining words in the page in pixels. These features are calculated by simulating how these pages are rendered on mobile and desktop with the help of Selenium web browser automation.Footnote 6 This provides all information about rendered DOM elements in HTML at runtime.

Pearson’s correlation (R) between top five statistically significant features (p-value \(< 0.05\)) and mobile/desktop judgments is shown in Table 2. As we can see content oriented features are highly correlated with desktop judgments but both view and size oriented features are correlated with mobile judgments.

Table 2. Pearson’s R between features and judgments on mobile and desktop

4.2 Mobile Relevance Prediction

Several classifiers can be trained with large combination of features to determine relevance. We mainly focus on training Adaboost classifier with multiple feature combinations.

Table 3 shows the results for classification for several feature combinations. The row labeled all indicates the model trained with all the features. The rows labeled no.x correspond to models using all the features but features of type x. The rows labeled only.x have metrics for models trained on features in group x. Finally, four pairs of features – geom.size, geom.view, geom.html and view.html correspond to models trained on features in either geometric, content (viewport (view) or html) or display group. Treating the model trained with all the features as baseline, the statistically significant (p-value \(< 0.05\) for paired t-test) models have been marked with (*).

The classification accuracy for mobile relevance is significantly better than random. The best performing feature combination is viewport features (only.view) for mobile. The confusion matrix for best model (only.view) is given in Table 4. Surprisingly, the accuracy does not improve on using all features. There is also no improvement in performance when display specific features are taken into account. Binary classifier trained solely on html features does worse than the classifier trained using viewport features. In pairwise combinations, content specific features (view.html) perform the best, which is perhaps due to the presence of viewport based features in the model.

Table 3. Classification results for Mobile and Desktop

Our hypothesis that geometric and display oriented features impact relevance are not supported by the results. Display features, for instance, when used alone for binary classification have the lowest accuracy amongst all feature combinations. It is worth noting that when viewport features are dropped (no.view) from training, the accuracy goes down by 3 % when compared to model trained on all the features.

Features that had highest scores while building the viewport (only.view) classifier are mentioned below. The most important set of features in decreasing order of their importance are: total tokens in view port (0.28), number of images in viewport (0.20), number of tables (0.16), number of sentences (0.10), number of outlinks with query terms (0.07), number of outlinks (0.06) and finally number of headings with query terms (0.06).

Table 4. Mobile Confusion matrix (only.view model)
Table 5. Desktop Confusion matrix (view.html model)

4.3 Desktop Relevance Prediction

The results for document relevance are shown in Table 3. The overall accuracy of relevance prediction on desktop pages is low. It is in fact lower than that observed on mobile. The best performing system is one trained with content based or viewport and html features (view.html). The confusion matrix for best model (view.html) is given in Table 5. The difference between the accuracy of best performing model on mobile (only.view) and best performing system on desktop (view.html) is 17 %. It is interesting to note that viewport features are useful indicators of relevance regardless of judging device. The models with viewport features (only.view, geom.view and view.html) perform better than model built using all features. This suggests that users tend to deduce page relevance from immediate visible content once page finishes loading.

It is also surprising to note that classifier trained on features extracted from the entire document (only.html) performs worse the one trained using only viewport features (only.view). This could be due to the limited number of features used in our study. Perhaps, with more extensive set of page specific features, the classifier may perform better.

Our hypothesis that geometric features affect relevance is not supported by either experiment. Overall, geometric features are not useful in predicting relevance on desktop, the classifier trained only on geometric features achieves only 54 %, 10 % lower than model trained with all features. It is not surprising to observe the jump in accuracy once geometric features are removed from the model training. Thus, both the experiments suggest that position of query terms and headings, on both mobile and desktop, is not useful in predicting relevance.

Amongst models trained on a single set of features, the model with display specific features (only.display) performs the best with 64 % accuracy and 0.74 F1-score. It seems that font size of query terms, headings and other words is predictive of relevance. However, only.display model’s accuracy (64 %) is still lower than that of view.html model. Amongst models trained on pairs of features, view.html performs best (67 %), closely followed by geom.view (62 %) model. This is perhaps due to the presence of viewport features in both models.

The most representative features in content based (view.html) classifier, in decreasing order of their importance are: content specific features such as number of headings (viewport) (0.23), number of images (viewport) (0.14) and query term frequency (html) (0.08), number of unique tokens (html) (0.06), number of tables (viewport) (0.06), number of words (html) (0.05) and finally number of outlinks (0.04).

Despite promising results, our study has several limitations. It is worth nothing that our study only contained 236 query-url pairs, with more data and an extensive set of features prediction accuracy would improve. We used query-url pairs from previous study, which had gathered queries for only seven topic descriptions or tasks. We shall follow up with a study containing more number and variety of topics to further analyze impact of device size on judgments.

Overall, our experiments indicate that viewport oriented features are useful in predicting relevance. However, model trained with viewport features on mobile judgments significantly outperforms the model trained on desktop judgments. Our experiments also show that features such as query term or heading positions are not useful in predicting relevance.

5 Conclusion

Traditional relevance judgments have always been collected via desktops. While existing work suggests that page layout and device characteristics have some impact on page relevance, we do not know whether page relevance changes with change in device. Thus, with this work we tried to determine whether device size and page rendering has any impact on judgments, i.e. different web page layouts (on mobile or desktop) translate into different relevance labels from judges. To that end, we systematically compared crowdsourced judgments of query-url pairs from mobile and desktop.

We analyzed different aspects of these judgments, mainly observing differences in how users evaluate highly relevant and non-relevant documents. We also observed strikingly different judging times, despite similar inter-rater agreement on both devices. We also used some viewport oriented features to predict relevance. Our experiments indicate that they are useful in predicting relevance on both mobiles and desktop. However, prediction accuracy on mobiles is significantly higher than that of desktop. Overall, our study shows that there are certain differences between mobile and desktop judgments.

There are several directions for future work. The first and foremost would be to scale this study and analyze the judging behavior more extensively to draw better conclusions. Secondly, it would be worthwhile to investigate further the role of viewport features on user interaction and engagement on mobiles and desktops.