Shop by image: characterizing visual search in e-commerce

Dagan, Arnon; Guy, Ido; Novgorodov, Slava

doi:10.1007/s10791-023-09418-1

Shop by image: characterizing visual search in e-commerce

Published: 03 March 2023

Volume 26, article number 2, (2023)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Shop by image: characterizing visual search in e-commerce

Download PDF

Arnon Dagan¹,
Ido Guy² &
Slava Novgorodov³

1622 Accesses
1 Citation
Explore all metrics

Abstract

Visual search has become more popular in recent years, allowing users to search by an image they are taking using their mobile device or uploading from their photo library. One domain in which visual search is especially valuable is electronic commerce, where users seek for items to purchase. Despite the increasing popularity of visual search in e-commerce, no comprehensive study has inspected its characteristics compared to traditional search using a text query. In this work, we present an in-depth comprehensive study of visual e-commerce search. We perform query log analysis of one of the largest e-commerce platforms’ mobile search application. We compare visual and textual search by a variety of characteristics, with special focus on the retrieved results and user interaction with them. We also examine image query characteristics, refinement by attributes, and segmentation by user types. Additionally, we examine, for the first time, a wide variety of visual pre- and post-retrieval query performance predictors, several of which showing strong results. Our study points out a variety of differences between visual and textual e-commerce search. We discuss the implications of these differences for the design of future e-commerce search systems.

Understanding Digital Marketing—Basics and Actions

Personalized mobile marketing strategies

Article 16 October 2019

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

1 Introduction

The growing popularity of search from mobile devices equipped with a camera and the advancement in computer vision techniques have given rise to a new form of search: search by image, also commonly referred to as visual search. Visual search enables users to input an image as a query and retrieve a ranked list of results based on their relevance to the input image. Major Web search engines, such as Google and Bing, have introduced visual search functionally, which allows querying for information that is hard to articulate by text (Bitirim et al., 2020; Hu et al., 2018). Neural network techniques for image recognition support effective feature representation, classification, segmentation, and detection, and enable efficient retrieval of relevant results given an image query over huge corpora (Wan et al., 2014; Kim et al., 2016; Shiau et al., 2020). As Web content becomes ever more visual (Shiau et al., 2020), with the explosive growth in the number of online photos in social media and other websites (Zhai et al., 2017; Zhang et al., 2018), allowing users to express their information needs through an image becomes imperative. And in a culture already dominated by the visual, it is only natural that an image would be used to start a search.

In recent years, visual search has been implemented and studied in a variety of domains, such as travel (Parikh et al., 2018), news (Saez-Trumper, 2014; Elkasrawi et al., 2016), healthcare (Hegde et al., 2019; Gandomkar & Mello-Thoms, 2019), education (Shapovalov et al., 2018), and food (Lien et al., 2020; Zhu et al., 2019). Notably, one of the most popular visual search domains is electronic commerce. Sometimes referred to as visual shopping (Togashi & Sakai, 2020), visual search in e-commerce allows customers to search for listed items or catalog products using an image instead of the keywords normally used in e-commerce search (Li et al., 2020a). This type of search naturally reflects offline shopping processes, which are often driven by visual inspection, and brings a sense of visual discovery to the online world (Zhang et al., 2018; Bhardwaj et al., 2013). For instance, a customer may take a photo of a hat, either of another person, or in a store, and would like to instantly look up prices or stock availability from online stores (Bhardwaj et al., 2013). Alternatively, they may run into an image of an item they like on their social media feed and would like to quickly search for it (Goel, 2017).

Search by image has a number of potential advantages over traditional text-based search. First, it can be fast and intuitive, as simple as uploading or taking a picture and triggering a search. Second, it is agnostic to language, which becomes increasingly important as online shopping becomes global. In addition, it does not require from customers to be acquainted with the terminology used by the e-commerce site for the type of merchandise they are seeking (Li et al., 2020a). For example, some users might be interested in “jeans with holes”, but the relevant products are described as “distressed jeans” (Laenen et al., 2018). Some e-commerce categories, such as Fashion, Home Decor, or Art, are fundamentally defined by visual characteristics that are sometimes difficult if not impossible to articulate by text (Shiau et al., 2020; Bhardwaj et al., 2013). For instance, on Etsy, an online marketplace for handmade and vintage goods, Style is particularly important as buyers often seek items that match their eclectic tastes (Jiang et al., 2019). Even after filtering by a large number of attributes, users may be confronted with hundreds of items that differ by Style (Jiang et al., 2019). In Fashion, customers often seek a new look, outfit, or theme; visual search technology helps express these aesthetic aspects in a way text has never been able to capture (Bell et al., 2020). On the other hand, it should be noted that other aspects, such as size or brand names, can be more easily expressed by text.

Integrating visual search capabilities can enhance customer experience and increase engagement. In a recent survey by visual content company ViSenze, $62\%$ of Millennials and Gen Z consumers indicated they wish for visual search over any other new technology.^{Footnote 1} Photo sharing service Pinterest reported that among its 350M monthly users, many have expressed a desire for visual shopping (Shiau et al., 2020). A study from The Intent Lab found that 85% of the young respondents put more importance on visual information than textual information.^{Footnote 2}

In recent years, many Web and e-commerce sites have introduced visual search functionality into their commercial applications (Jing et al., 2015; Yang et al., 2017; Zhang et al., 2018; Li et al., 2018; Hu et al., 2018; Bitirim et al., 2020). E-commerce platform Alibaba reported that their “search by image” application triggered high attention and wide recognition , and has experienced swift growth with an average of over 17 million daily active users in 2017 (Zhang et al., 2018). Pinterest integrated “shoppable” pins into its visual search, making it easier for users to purchase products they have taken photos of (Shiau et al., 2020). However, despite the growing popularity of visual search, to the best of our knowledge no study has performed an in-depth analysis of visual search usage. The majority of the literature on visual search in recent years has focused on describing the end-to-end system architecture (Yang et al., 2017; Jing et al., 2015; Hu et al., 2018; Li et al., 2018; Zhang et al., 2018; Lin et al., 2019), demonstrating the complexity of real-world visual search systems (Jing et al., 2015), with challenges such as large gaps in image quality between the query and results; indexing of dynamic data; and training of large-scale ranking models (Zhang et al., 2018). Another focus of area has been the evaluation ranking models, including feature representation, retrieval models, and similarity calculation (Wan et al., 2014; Zhai et al., 2017; Hsiao et al., 2014; Misraa et al., 2020; Li et al., 2020a; Zhai et al., 2019; Zhang et al., 2019).

In this work, we perform a search log analysis of over 1.5 million image queries, issued to the mobile application of eBay, one of the most widespread e-commerce platforms, over a period of four weeks. We compare the image queries with a sample of text queries of similar size, performed on the same mobile application during the same time period. Our comparison encompasses characteristics of context, sessions, retrieved results, attributes (facets) used for query refinement, users, and clicks. We also analyze the image searches according to several unique characteristics of images, comparing searches with images captured from the device’s camera to searching with gallery images. In the final part of our work, we experiment with query performance prediction for visual search, revealing several novel pre- and post-retrieval predictors that demonstrate significant performance.

Our key contributions can be summarized as follows:

To the best of our knowledge, we present the first comprehensive in-depth analysis of visual e-commerce search log.
We combine analysis of queries, sessions, retrieved results, refining attributes, users, and clicks to shed more light on the common and different between image and text queries.
We provide empirical evidence that image queries are more specific than text queries.
We evaluate a set of query performance predictors for visual search and compare them with classic predictors in textual search. Our evaluation reveals a variety of strong predictors for the performance of visual queries.

This work extends the findings previously reported in Dagan et al. (2021). The main new contributions include the following:

Additional analysis of result page characteristics, considering the title length and price (Sect. 6.2).
Introduction of the notion of “intent sessions”, which captures a more focused sequence of interactions with the search engine (Sect. 7).
Analysis of price, condition, and other “global” filters used for search refinement, alongside category-specific attributes (Sect. 8).
Exploration of bias towards results with a similar image, showing an effect on visual e-commerce search click model (Sect. 9.2).
A new user-based analysis, exploring in depth whether the observed differences between visual and textual search are due to the different types of users or persist across the exact same set of users (Sect. 10).
Extension of our experiments with visual query performance prediction, introducing new pre-retrieval predictors and an additional evaluation metric (Sect. 11).

Overall, our findings suggest different ways for e-commerce search systems to enhance their support and take advantage of the unique characteristics of image queries. We conclude the paper by summarizing the key findings and discussing their implications and future research directions.

2 Related work

In this section, we cover related work, starting from search log analysis, through broad visual search, to visual search in e-commerce . For the latter, we also elaborate on work in the Fashion domain and a line of studies performed by the Pinterest image sharing service. Finally, we discuss the connection to work on visual e-commerce recommendation and image search.

2.1 Search log analysis

Studies of search log analysis, inspecting the queries submitted by users to search engines, and often times the returned results and the user’s interaction with them, have been published since the early days of the Web (Broder, 2002; Jansen, 2006). The emergence of mobile technologies led to a variety of studies inspecting queries submitted from mobile devices. For example, Kamvar and Baluja (2006) performed a large-scale analysis of Google mobile search, differentiating it from desktop search. Baeza-Yates et al. (2007) compared mobile and desktop search queries and found that mobile queries included more queries in the Business category and fewer in Art. Song et al. (2013) recommended, using log analysis of the Bing search engine, that ranking methods for smartphones, tablets, and desktop devices adapt to specific user behavior patterns in each of these platforms. Guy (2016) reported an in-depth analysis comparing spoken queries to typed-in queries on mobile web search, indicating that the language used in voice search is closer to a natural language. In this work, we perform a log analysis of mobile e-commerce search on eBay, one of the world’s largest marketplaces, in order to characterize visual search and compare it to textual search.

Log analysis of e-commerce search has been studied both as a prominent vertical of Web search (Jansen & Spink, 2006) and by inspecting the search logs of e-commerce platforms. Su et al. (2018) combined search log analysis and user survey to define three main intent categories for shoppers: target finding, decision making, and exploration. Sondhi et al. (2018) proposed a refined version with five categories based on analysis of e-commerce search logs, associating each of the categories with a distinct user search behaviour. Hirsch et al. (2020) provided a large scale and in-depth study of users’ query reformulations in e-commerce search and showed that well over $50\%$ of the queries take part in a reformulation session. In this work, we characterize visual search and compare it to textual search on e-commerce using a log analysis of one of the world’s largest marketplaces.

2.2 Visual search

The task of visual search, or search via an image query, has been extensively studied by the Computer Vision and Multimedia communities. Techniques have evolved from feature-based and bag-of-words approaches (Datta et al., 2005; Bhardwaj et al., 2013) to deep learning and semantic representation methods (Liao et al., 2018; Lin et al., 2019; Li et al., 2020a; Misraa et al., 2020). With the growing popularity of mobile devices that made camera use ubiquitous, and the advancement in deep learning techniques for computer vision and particularly for visual search, more studies started to emerge introducing visual search systems. These studies focus on the end-to-end architecture and, in some cases, evaluation of the retrieval model, rather than on query log analysis and behavioral characteristics, as explored in our work. Hu et al. (2018) provided an overview of the visual search system in Microsoft Bing. They described the methods used to address relevance (using a learning-to-rank approach with visual features), latency, and storage scalability and provided an evaluation of these three dimensions. Bhattacharya et al. (2019) presented a multimodal dialog system to help online customers visually browse through large image catalogs, using both visual and textual queries.

Web search using images has also been referred to as “reverse image search”. While in classic image search, the query is textual and the results are images, in reverse image search, the query is an image and the results are (typically) textual documents. Bitirim et al. (2020) performed an evaluation of Google’s reverse image search performance, in terms of average precision at varying sizes of result sets. Reilly and Thompson (2017) conducted a user study of reverse image search over a digital library.

2.3 Visual e-commerce search

Largely, the most popular domain of visual search research has been electronic commerce. In recent years, a variety of studies have been published describing the architectures of a “search by image” functionality introduced by multiple e-commerce platforms and evaluating different search algorithms to enable effective and efficient visual search. Zhang et al. (2018) introduced the large-scale visual search algorithm and system infrastructure at Alibaba. They discussed challenges such as bridging the gap between real-shot images from user queries and stock images; dealing with large-scale indexing of dynamic data; training deep models for effective feature representation without massive human annotation; and improving user-based metrics by considering image quality for result re-ranking. Their mobile application, named “Pailitao”, which means shopping through the camera, enabled “search by image” using the visual search service. In a followup work (Zhang et al., 2019), the authors proposed learning image relationships based on co-click embedding, to guide category prediction and feature learning. They showed that the richness of click data enables to better reflect users’ interests and and improve visual search relevance. Li et al. (2018) presented the design and implementation of a visual search system for real-time image retrieval on JD.com, one of China’s largest e-commerce sites. They demonstrated that their system can support real-time visual search with hundreds of millions of product images at sub-second timescales and handle frequent image updates through efficient indexing methods.

Yang et al. (2017) described the end-to-end approach for scalable visual search infrastructure at eBay, along with in-depth discussions of its basic components and optimizations, trading off search relevance and latency. They applied a supervised approach with multiple deep learning models to retrieve and rank listings from eBay’s huge inventory and showed, using an ImageNet benchmark, that their approach is faster and more accurate than several unsupervised baselines. Among the key challenges, they enumerated the dynamic nature of eBay’s inventory and its scale, the diverse image quality contributed from a variety of sellers; and the diverse quality of image queries, often captured by the shopper’s mobile device camera. To a large extent, our work takes advantage of the system described in that work to characterize visual search use on eBay and compare it with textual search. In an earlier study (Bhardwaj et al., 2013), eBay researchers presented a simple and fast search algorithm that uses color as a key feature for building visual search. They showed that low-level cues such as color can be used to quantify image similarity and distinguish between products with different visual appearance, to support fast and effective visual search.

2.4 Visual search for fashion

Much of the work on visual search in e-commerce has focused on the Fashion category. Many of the common attributes in this category, such as style (Kim et al., 2016; Kang et al., 2019; McAuley et al., 2015) or texture (Wróblewska & Rączkowski, 2016), are difficult to specify in words and easier to describe using an image. In addition, the notion of relevance in these categories is often mostly visual (Bhardwaj et al., 2013). Laenen et al. (2018) focused on the Dresses category and introduced a search interface and ranking model that supports the submission of an image query and the refinement of the image’s attributes using a text query. On eBay and other e-commerce platforms, such a refinement is enabled by facets (Tunkelang, 2009) presented according to the query’s category. Kim et al. (2016) described their visual Fashion search system on Korean e-commerce site SK Planet. Liao et al. (2018) incorporated the category tree hierarchy into a deep learning model to improve capturing user intent and reasoning of results in visual Fashion search. Togashi and Sakai (2020) examined the relationship between “visual intents” in terms of color, texture/material, design and user feedback in the form of clicks, likes, and purchases. They found that visual relevance positively correlates and can actually cause user feedback. They also recommended to diversify search results since different combinations of visual intents may lie behind the same image query. Our work examines visual search on eBay, which spans a variety of shopping categories. Our findings indicate that Fashion is indeed a popular category for visual search, but a few other categories are even more popular, such as Collectibles, Art, and Toys.

2.5 Visual search in pinterest

Image sharing service Pinterest has been a source of a variety of studies describing its visual search and discovery system and some of the algorithms behind it. All applications were directed at online shopping, giving another indication of the relevance of visual search to e-commerce. The earliest work (Jing et al., 2015) described how Pinterest built a cost-effective large-scale visual search system and showed its positive effect on user engagement. A follow-up work focused on the notion of visual discovery, enabling users to select any object in an image as a visual query (Zhai et al., 2017). They presented an overview of the visual discovery engine and shared the rationales behind technical and product decisions, such as the use of binarized features, object detection, and interactive user interfaces. Another work (Zhai et al., 2019) described the image embedding process behind Pinterest’s visual search, using a multi-task learning architecture capable of jointly optimizing multiple similarity metrics, such as browsing and searching relevance. They detailed how to jointly train for multiple product objectives and how to leverage both engagement data and human labeled data. The resulting unified embedding outperformed all specialized embedding trained with an individual task. Recently, Pinterest introduced “shop the look” (Shiau et al., 2020), with an explicit goal to fulfill the user’s shopping intent. The service detects objects within billions of inspirational scenes, and finds matching products from a huge product corpus that are visually similar to the detected objects. A related study named “complete the look” focused on the task of “style compatibility” in order to recommend complementary products for an outfit (Kang et al., 2019; Li et al., 2020b).

2.6 Visual e-commerce recommendation

Similarly to Pinterest, several other studies focused on a recommendation scenario, where the image “query” is not necessarily input explicitly by the user. Hsiao et al. (2014) used visual similarity to refine personalized product recommendations. McAuley et al. (2015) proposed a deep learning method based on heterogeneous graph embedding to recommend which clothes and accessories will complement each other well based on their appearance. Wróblewska and Rączkowski (2016) described a content-based visual recommender on a Polish online marketplace based on color and texture characteristics. Parikh et al. (2018) used convolutional neural networks (CNNs) to recommend tourist attractions, restaurants, and hotels based on their visual characteristics. In this work, we focus on pure visual search, analyzing a large-scale log of searches performed using explicit image and text queries.

2.7 Image search

Visual search should not be confused with the broad domain of image search, which refers to the results rather than the query: image search, i.e., search whose result set consists of images, is a popular search vertical and has been extensively studied (e.g., (Datta et al., 2008; Goodrum & Spink, 2001)). Image search and visual search naturally integrate when both the query and returned results are images. This type of search is often referred to as content-based image retrieval, or CBIR (Datta et al., 2005; Wan et al., 2014). In our work, however, we explore visual search in another popular search vertical – shopping – with e-commerce listed items as returned results. To the best of our knowledge, no comprehensive log analysis of a visual e-commerce search engine has been reported.

3 Research setting

Our analysis is based on a random sample of 1,635,632 image queries from the eBay mobile search application, performed by over 250,000 unique users along a period of exactly four weeks (February 2nd-29th, 2020) in the United States. The eBay mobile search application allows searching with an image by clicking on a camera icon to the right of the textual search box. The user can then either instantly take a photo to be used as the query using the device’s camera or upload an image from the device’s photo gallery. After inputting an image, either by camera or by gallery , the eBay search engine retrieves a list of relevant results to the image query, presented to the user according to their relevance rank. The returned list of results can be traversed from top to bottom (and back) by scrolling.

Figure 1 depicts a step-by-step demonstration of such process. After opening the mobile application, the user can choose to browse various categories or go to search mode by clicking the magnifying glass icon (Step 1). After entering search mode, the user can directly type the text query, use voice search by clicking the microphone icon, or use image search by clicking the camera icon (Step 2). Upon clicking the camera icon, the user enters the image search mode, where the three options include taking the photo instantly using the device’s camera, uploading a photo from the device’s gallery, or scanning a barcode (Step 3). If the user decides to take a new photo, the app opens a photo taking interface with an option to toggle between front and back camera and selecting of various aspect ratios (Step 4). Finally, after taking the photo (in this example a wireless controller for PlayStation 5), the search engine retrieves the most relevant results and present them to the user (Step 5).

For comparison, we collected an identical number of queries performed using the “regular” textual search box of the same mobile application. We refer to the former set of queries as image queries and to the latter as text queries. The text queries were collected along the same period of four weeks for a similar number of users. Moreover, we sampled an identical number of image and text queries in each day of the experimental period. Basic demographic data including gender, age, and location in terms of city and state was approved for analysis in aggregate across the entire image and text datasets. When inspecting day-of-week distribution and session statistics, we compared all queries from all users in our image sample with all queries from all users in our text sample, during four weeks of the experimental period, to allow suitable analysis. The portion of unique queries out of all queries was similar between the image and the text samples: $69.5\%$ and $73.5\%$, respectively.

Each query in the log, either image or text , included, in addition to the query itself, a timestamp (adapted to the timezone in which it was performed) and the list of retrieved results presented to the user on the search engine results page (SERP). Each returned result is a listed offer, or listing in short, by a specific seller. In other words, the same product may appear multiple times on the SERP, with different sellers, prices, delivery options, and so forth. Our data included, for each result, its rank on the SERP (the top result is at rank 1) and a unique listing URL. In addition, for each query we had information about its associated clicks and purchases, if any were performed, including their ranks and corresponding listing URLs. After a query (image or text ) is submitted and the results are presented, the user can refine the result list using attributes , such as color, brand, or size. Our log included the attributes used for refinement and their values or value ranges (e.g., color ‘blue’ or size over 40 inches).

eBay spans a variety of shopping domains. Each listing on eBay is associated with a leaf category (LC ), which is the most specific type of node in the eBay’s taxonomy. The taxonomy includes tens of thousands of LCs , such as Electric Table Lamps, Developmental Baby Toys, or Golf Clubs. Each listing is also associated with one out of 43 meta-categories (MCs ), such as Home & Garden, Toys & Hobbies, or Collectibles. For each result on the SERP, we had information about the LC and MC it belonged to.

Our analysis is organized as follows. Section 4 compares basic characteristics of image and text searches, including searcher’s demographics and context. Section 5 examines the image query characteristics, including source (captured by camera or uploaded from gallery), orientation (vertical or horizontal), brightness, and catalog quality. Section 6 looks into the characteristics of retrieved results, including their category distribution and image quality. Section 7 inspects sessions characteristics. Section 8 examines the attributes used to refine image queries in comparison with text queries, while Sect. 9 inspects click characteristics, such as click-through rate and mean reciprocal rank. In Sect. 10, we examine user characteristics, focusing on those who use both visual and textual search. Finally, in Sect. 11, we describe our experimentation with a set of new pre- and post- retrieval performance predictors for visual search.

Table 1 summarizes key variable and measurements that were considered in this paper.

Table 1 Variable and measurements summary

Shop by image: characterizing visual search in e-commerce

Abstract

Similar content being viewed by others

Understanding Digital Marketing—Basics and Actions

Personalized mobile marketing strategies

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

1 Introduction

2 Related work

2.1 Search log analysis

2.2 Visual search

2.3 Visual e-commerce search

2.4 Visual search for fashion

2.5 Visual search in pinterest

2.6 Visual e-commerce recommendation

2.7 Image search

3 Research setting

4 Context and demographics

5 Queries

6 SERP

6.1 Categories

6.1.1 Number of categories

6.1.2 Category distribution

6.2 Other characteristics

6.2.1 Title length

6.2.2 Image quality

6.2.3 Price

7 Sessions

8 Query refinement by attributes

9 Clicks

9.1 Click-through rate and rank

9.2 Image similarity bias

9.3 Examples

10 Users

10.1 Frequent visual search users

10.2 Text queries by visual search users

10.2.1 Click characteristics

10.2.2 Query refinement

10.2.3 Dominant category distribution

10.2.4 Time-of-day

11 Query performance predictors

11.1 Definitions

11.1.1 Textual QPPs

11.1.2 Visual pre-retrieval QPPs

11.1.3 Visual post-retrieval QPPs

11.2 Evaluation

11.2.1 Pre-retrieval QPPs

11.2.2 Post-retrieval QPPs

12 Discussion and implications

12.1 Limitations

12.2 Future directions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation