Open Cross-Domain Visual Search

This paper introduces open cross-domain visual search, where categories in any target domain are retrieved based on queries from any source domain. Current works usually tackle cross-domain visual search as a domain adaptation problem. This limits the search to a closed setting, with one fixed source domain and one fixed target domain. To make the step towards an open setting where multiple visual domains are available, we introduce a simple yet effective approach. We formulate the search as one of mapping examples from every visual domain to a common semantic space, where categories are represented by hyperspherical prototypes. Cross-domain search is then performed by searching in the common space, regardless of which domains are used as source or target. Having separate mappings for every domain allows us to search in an open setting, and to incrementally add new domains over time without retraining existing mapping functions. Experimentally, we show our capability to perform open cross-domain visual search. Our approach is competitive with respect to traditional closed settings, where we obtain state-of-the-art results on six benchmarks for three sketch-based search tasks.


Introduction
This work investigates categorical search in an open cross-domain setting. Cross-domain visual search has made a lot of progress recently in a closed setting, where natural images [13,42] or 3D shapes [22][23][24] examples of categories are retrieved from sketches. Such a closed setting only considers a search from one fixed source domain to one fixed target domain. In practice however, categories come in many forms [25,36,51]. Hence, we may have queries from several source domains, or want to search in any possible combinations of source and target domains. As a first contribution of this work, we introduce open cross-domain visual search: we search for categories from any source domain to any target domain, with the ability to search from and within multiple domains simultaneously. Addressing the domain gap between source and target domains has proven to be effective for cross-domain search [4,10,12,13,44,48,52,54]. While intuitive and compelling, focusing on domain adaptation with pair-wise training makes the search unsuited for an open setting where multiple visual domains are available. To move towards an open setting, we should align examples by the very thing that unites them, namely their semantics, rather than aligning the domains they originate from. As a second contribution, we propose a simple approach for open cross-domain visual search, where we start from a common semantic space in which categories are represented by hyperspherical prototypes. For every domain, we learn a function to map visual inputs to their corresponding prototypes in the common semantic space, as illustrated in Figure 1. Query representations for search are further refined with neighbours from other domains through a spherical linear interpolation operation. Once trained, the proposed formulation allows us to search among any pair of domains. Since all domains are now aligned semantically in the common semantic space, this enables a search from multiple source domains or in multiple target domains. Lastly, new domains can be added on-the-fly, without the need to retrain previous models.
As a third contribution, we perform extensive evaluations to demonstrate our ability to perform open crossdomain visual search, as well as our efficacy in stan-dard closed settings compared to current approaches. For open cross-domain visual search, we perform several novel demonstrations showing: i) a search between any pair of source and target domains without hassle, ii) a search from multiple source domains, and iii) a search in multiple target domains. While designed for the open cross-domain setting, we find that our approach works well in conventional closed settings also. We compare on sketch-based image and 3D shape retrieval. Across three tasks and six benchmarks, we obtain state-of-the-art results, highlighting the effectiveness of our approach. All code and setups will be released to foster further research in open cross-domain visual search.

Related Work
Cross-domain visual search A wide range of works have focused on cross-domain visual search by setting the source domain as sketches. Natural images [13,42] or 3D shapes [19,22,24] of the same category are then retrieved given the sketch query. When searching for natural images from a sketch, a common approach is to bridge the domain gap between sketches and images [10,12,44,54]. Shen et al. [44] fuse sketch and image representations with a Kronecker product layer [20], while Yelamarthi et al. [54] introduce domain confusion with generative models. Dey et al. [10] combine gradient reversal layers [15] with metric learning losses [16,43] to further enforce a domain agnostic embedding space. Dutta and Akata [12] tie the semantic space with visual features by learning to generate them. Alternatively, Liu et al. [27] preserve the knowledge from a pre-trained model. Hu et al. [19] have also explored fewshot image classification by synthesizing classifiers from sketches. By focusing on domain adaptation, current approaches are limited to a mapping from one source domain (e.g., sketch) to one target domain (e.g., image). In this paper, we move this paradigm towards open cross-domain visual search, where search occurs from any source domain to any other target domain.
Searching for 3D shapes from a sketch has been accelerated by the SHREC challenges [22][23][24]. A recent trend is to perform cross-domain retrieval from 2D image domain to the 3D shape domain [4,8,40,47,48,52]. In this setting, Wang et al. [48] map both sketches and 3D shapes in a similar feature space with a Siamese network [6,17], while Tasse et al. [47] learn to regress to a semantic space with a ranking loss [14]. Dai et al. [8] correlate both sketch and 3D shape representations to bridge the domain gap. Xie et al. [52] employ the Wasserstein distance to create a barycentric representation of 3D shapes. Qi et al. [40] apply loss functions on the label space rather than the feature space. Chen et al. [4] propose an advanced sampling of 2D views for unaligned shapes. Akin to [40,47], we place a central role on semantics for cross-domain search. In this paper, we go beyond searching for only 3D shapes to searching among any number of available target domains. We map multiple domains to semantic prototypes in a common embedding space, which alleviates the need for multistage training and negative sampling schemes.
Using multiple domains has recently been investigated in unsupervised domain adaptation [7,37] or domain generalization [1] works, where the task is to classify unlabeled target samples by learning a classifier on labeled source samples. Learning a classifier from multiple sources has been shown to be beneficial for both tasks (e.g., [2,11,36,53,57]). In this paper, we focus on a different multi-domain task, namely open cross-domain visual search.
Learning with prototypes Learning metric spaces with prototypes for image retrieval [9,28,34,49,50,55] and classification [5,31,32] provides a simpler alternative to common contrastive [6,17] or triplet [43] loss functions. No complex sampling is required, making the training easier in return [34,50]. A first line of work learns prototype representations, such as the center loss [50], the proxy loss [34,55], and derivations that introduce a margin in the distance measure [9,28,49]. A second line of work fixes prototype representations. Mensink et al. [31] set class means as prototypes in the embedding space for zero-shot classification. Chintala et al. [5] show that regressing to one-hot prototypes is close to a softmax classifier. Mettes et al. [32] better position prototypes on a low-dimensional hypersphere for classification and regression. Here, we take inspiration from this metric learning literature with prototypes and leverage them to the problem of open crossdomain visual search. We create a common semantic space where classes are represented by prototypes on a hypersphere. Every domain has its respective model to map visual inputs to the common semantic space where the crossdomain search occurs.

Problem formulation
The problem formulation of open cross-domain visual search is illustrated in Figure 2. While the closed crossdomain setting focuses on one fixed source s and one fixed target t, the open cross-domain setting searches for categories from any source domain s k to any target domain t k . As multiple domains now become available, this opens the door for combining multiple domains at both source and target positions. Thus, the main difference between the closed setting and the open setting lies in the ability to leverage multiple domains for categorical cross-domain search.
Formally, let D denote the set of all domains to be considered. Rather than making an explicit split of a dataset into source and target, we consider a large combined visual collection T = {(x d n , y n )} N n=1 , where x d n ∈ I d denotes an input example from a visual domain d ∈ D of category s t (a) one source to one target any source to any target (c) many sources to any target (d) any source to many targets

Proposed approach
We pose open domain visual search as projecting any number of heterogeneous domains to prototypes on a common and shared hyperspherical semantic space. First, we outline how to represent categories in the semantic embedding space. Second, we propose a mapping function for every domain to the common semantic embedding space.
Categorical prototypes. We leverage the concept of prototypes to represent categories in a common semantic space. Every category is represented by a unique real-valued vector, corresponding to a categorical prototype. Hence, the objective is to align examples, coming from different domains but with the same category label, to the same categorical prototype in the common semantic space. For every category y ∈ Y, we denote its prototype on the semantic space as φ(y) ∈ S D−1 for a D-dimensional hypersphere. Relying on semantic relations enables to search for unseen classes using models trained on seen categories [14,35]. In this work, we opt for word embeddings (e.g. word2vec [33] or GloVe [38]) to represent categories, as these embeddings adhere to the semantic relation property.
Mapping domains to categories. For every domain d ∈ D, we learn a separate mapping function f d (·) ∈ S D−1 to the common and shared semantic space. Separate mapping functions are not only easy to train, they also enable us to incorporate new domains over time. Indeed, we only have to train the mapping of the new incoming domain without retraining previous mapping functions of existing domains. The mapping function itself is formulated as a convolutional network (ConvNet) with 2 -normalization on the D-dimensional network outputs.
We propose the following function to map an example of a certain domain d to its categorical prototype in the shared semantic space: where s ∈ R >0 denotes a scaling factor, inversely equivalent to the temperature [18]. Intuitively, the scaling controls how samples are spread around categorical prototypes. The amount of scaling s is a hyperparameter that we study in the supplementary materials. The distance function c(·, ·) is defined as the cosine distance: where < ·, · > is the dot product. As both f d (x) and φ(y) lie on a hypersphere, they have a unit norm. Finally, learning every mapping function f d is done by minimizing the crossentropy over the training set: In our approach, the representations of the categorical prototypes remain unaltered. Hence, we only take the partial derivative with respect to the mapping function parameters. Searching across open domains. In the search evaluation phase, similarity between source and target samples is measured with the cosine distance in the shared semantic space.
(a) Ideal. Given one or more queries from different source domains, we first project all queries to the shared semantic space and average their positions into a single vector. Then, we compute the distance to all target examples to rank them with respect to the query. We can on-the-fly combine source domains to search from or target domains to search within.

Refining queries across domains
With our approach, a source query is close to target examples from the same category, regardless of the domains of the query and target examples. In practice, inherent variability in the hyperspherical semantic space can cause noise in the similarity measures. We therefore propose to refine the initial query representation using a nearby example from the target domain. Figure 3 illustrates the refinement.
We refine the query representation p 0 by performing a spherical linear interpolation with a relevant representation p 1 . This relevant representation is either the nearest neighbour in the target set (for retrieval) or the word embedding of the category (for classification). The refined representation operationp is expressed as: where Ω = arccos (p 0 · p 1 ) and λ ∈ [0, 1] controls the amount of mixture in the refinement process. The higher the value of lambda is, the further away the refined representation is from the original representation p 0 . Intuitively, the refinement performs a weighted signal averaging to reduce the noise present in the initial representation. The amount of interpolation λ is a hyperparameter that we study in the supplementary materials.

Open Cross-Domain Visual Search
In the first set of experiments, we demonstrate our newly gained ability to perform open cross-domain visual search in three ways. We note that this is a new setting, making direct comparisons to existing works infeasible. First, we   Setup. We evaluate on the recently introduced Domain-Net [36], which contains 596,006 images from 345 classes. Images are gathered from six visual domains: clipart, infograph, painting, pencil, photo and sketch. We consider retrieval in a zeroand many-shot settings: i) in the zeroshot setting, Y is split into Y train and Y test , with Y train ∪ Y test = ∅, i.e. categories to be searched during inference have not been seen during training; ii) the many-shot setting uses the same categories during both training and testing. The zero-shot setting randomly splits samples into 300 training and 45 testing classes with at least 40 samples per class. The many-shot setting follows the original splits [36]. We report the mean average precision (mAP@all). Briefly, we use SE-ResNet50 [21] pre-trained on ImageNet [41] as a backbone, and word2vec trained on a Google News corpus [33] as the common semantic space. We optimize the loss function with Nesterov momentum [46]. We set the learning rate to 1e−4 with cosine annealing without warm restarts [29] and the batch size to 128.

From any source to any target domain
First, we demonstrate how searching from any source to any target domain in an open setting is trivially enabled by our approach. Figure 4 shows the result of 72 cross-domain search evaluations; corresponding to all six cross-domain pairs in both the zero-shot and many-shot settings. In our formulation, such an exhaustive evaluation is enabled by training only six models, one for every domain. For comparison, a domain adaptation approach -the standard in current cross-domain search approaches-requires a pair-wise For DomainNet, we find that the photograph domain provides the most effective search whether used as source or target. One reason is the number of available images, which is up to four times larger than other domains. On the other hand, infographs and sketches are very diverse in terms of scale and visual representations, which induces a much more difficult search. We conclude from the first demonstration that search from any source to any target domain is not only feasible with our approach, it can be done easily since we bypass the need to align different domains.

From multiple sources to any target domain
Second, we demonstrate the potential to search from multiple source domains. Due to the generic nature of our approach, we are not restricted to search from a single source. Here, we show that a multi-source search benefits the search in any target domain. For this experiment, we start from the sketch domain as a source and investigate the effect of including queries from the most effective source (photographs) and the least effective source (infographs). Table 1 highlights the positive effect of searching with an additional domain, rather than a single source domain. When using multiple sources, we simply average the positions in the common semantic space. For fairness, we also evaluate search using two sketches. Across all settings, we find that searching from multiple queries improves relative to using one single sketch query. In the zero-shot setting, including infographs and photographs improves upon sketchbased search only. In the many-shot setting, including infographs improves upon search by one sketch, but not by For abstract categories such as sun, abstract domains such as clipart or pencil drawings tend to be retrieved first. When sketches are more ambiguous such as calculator, some retrieved results are incorrect but resemble the shape. two sketches, which is not surprising given the low search scores for infographs individually. Including photographs with sketches obtains the highest scores, regardless of the target domain or the evaluation setting.
This demonstration shows the potential of searching from multiple sources. It is better to diversify the search by using multiple domains than include more queries from the same domain. Similar to the first demonstration, this evaluation is a trivial extension to our approach, as we only have to average positions in the shared semantic space, regardless of the domain the examples come from.

From any source to multiple target domains
Third, we demonstrate our ability to search in multiple domains simultaneously. This setting has potential applications for example in untargeted portfolio browsing, where a user may want to explore all possible visual expressions of a category. Exploring in multiple domains also highlights whether certain categories have a preference towards specific domains, which offers an insight on how to best depict those categories. Note that this setting can also be easily extended to include also multiple domains as a source. For the sake of clarity, we use sketch as the source domain and search in the other five domains.
Sketchy Extended TU-Berlin Extended mAP@all prec@100 mAP@all prec@100 EMS [30] n/a n/a 0.259 0.369 CAAE [54] 0.196 0.284 n/a n/a ADS [10] 0.369 n/a 0.110 n/a SEM-PCYC [12] 0.349 0.463 0.297 0.426 SAKE [27] 0.535 0.677 0.471 0.600 This paper 0.649 0.708 0.517 0.557 Table 2: Comparison 1 to zero-shot sketch-based image retrieval on Sketchy Extended and TU-Berlin Extended (multi-class accuracy). Aligning the semantics, rather than domains, improves cross-domain image retrieval. Figure 5 provides qualitative results for six sketches from different categories. We first observe that the results come from multiple target domains, without being explicitly told to do so. We do not need to align results from different target domains, since we measure distance in the common semantic space. For categories such as sun, we have a bias towards retrieving abstract depictions, such as pencil drawings and cliparts, as the sun is a category with a clear abstract representation. Castle on the other hand has a bias towards both distinct cliparts, as well as photographs and paintings. In both cases, all top results are relevant. For categories with more ambiguous sketches, such as river or calculator, retrieved examples resemble the shape of the provided sketch, but do not match the category. Overall, we conclude that searching in multiple domains is not only trivial in our approach, but is also an indicator of the presence of preferential domains for depicting visually categories.

Closed Cross-Domain Visual Search
Our approach is geared towards open cross-domain visual search, as demonstrated above. To get insight in the effectiveness of our approach for cross-domain visual search in general, we also perform an extensive comparative evaluation on standard cross-domain settings, which search from one source domain to one target domain. In total, we compare on three of the most popular cross-domain search tasks, namely zero-shot sketch-based image retrieval [13,42,44], few-shot sketch-based image classification [19], and manyshot sketch-based 3D shape retrieval [22,24]. For our approach, we simply train one mapping function for the source domain, and one for the target domain using the examples provided during training. Since each approach in closed cross-domain visual search employs different networks and optimizations, an apples-to-apples comparison is not feasible. Hence, we compare our results to the current state-ofthe-art results as reported in the respective papers. Implementation details are in the supplementary materials. Below, we handle each comparison separately.  Figure 6: Qualitative analysis of zero-shot sketch-based image retrieval. We show six sketches of Sketchy Extended, with correct retrievals in green, incorrect in red. For typical sketches (e.g., cup), the closest images are from the same category. For ambiguous sketches (e.g., tree) or noncanonical views (e.g., butterfly), our approach struggles.

Zero-shot sketch-based image retrieval
Setup. Zero-shot sketch-based image retrieval focuses on retrieving natural images (target domain) from a sketch query (source domain). We evaluate on two datasets. Sketchy Extended [26,42] contains 75,481 sketches and 73,002 images from 125 classes. Following Shen et al. [44], we select 100 classes for training and 25 classes for testing. TU-Berlin Extended [13,56] contains 20,000 sketches and 204,070 images from 250 classes. Similarly, following Shen et al. [44], we select 220 classes for training and 30 classes for testing. For both datasets, we select the same unseen classes as in Liu et al. [27]. Following recent works [12,27,44], we report the mAP@all and the precision at 100 (prec@100) scores.
Results. Table 2 compares to five state-of-the-art baselines on both datasets. Baselines mostly focus on bridging the domain gap between sketches and natural images with domain adaptation losses [15,16]. On Sketchy Extended, our approach outperforms other baselines. On TU-Berlin Extended, we obtain the highest mAP@all, while the recently introduced SAKE by Liu et al. [27] obtains higher prec@100. SAKE is then better at grouping images from the same category together, while our approach is better at retrieving relevant images in the first ranks. We also report on quantized representations in the supplementary materials, with similar improvements over existing baselines. Overall, our formulation based on semantic alignment is competitive with respect to alternatives that focus on domain adaption or knowledge preservation.  Qualitative analysis. To understand which sketches trigger the performance of natural image retrieval, we provide several qualitative example sketches with their top retrieved images in Figure 6. Our approach works well for typical sketches of categories, while results degrade when sketches are ambiguous or in non-canonical views.

Few-shot sketch-based image classification
Setup. Few-shot sketch-based image classification focuses on classifying natural images from one or a few labeled sketches. The few-shot categories to be evaluated have not been observed during training. Different from the zeroshot retrieval scenario, the few-shot classification setting has access to the labels of the unseen classes in the evaluation phase. For example, this comes through the form of sketches or word embeddings. We report results on the Sketchy Extended dataset [26,42]. Following Hu et al. [19], we select the same 115 classes for training and 10 classes for testing. We evaluate the performance with the multiclass accuracy and report results over 500 runs. Classification is done by measuring the distance to the class prototypes. We evaluate on three different modes [19]. First, we set the word vectors (w2v) to be prototypes of the unseen classes. Second, we set one or five sketch representations to be prototypes. Third, we use one or five images. The latter is considered as an upper-bound of the cross-domain task.
Results. Table 3 compares our formulation to two baselines introduced by Hu et al. [19]. M2M regresses weights for natural image classification from the weights of the sketch classifier while F2M regresses weights from sketch representations. For the first evaluation mode, we obtain an accuracy of 76.73%, compared to 35.90%, which reiterates the importance of a semantic alignment for categorical crossdomain search. In the few-shot evaluation, we find that the biggest relative improvement is achieved in the one-shot setting. Our approach is then effective for cross-domain classification, especially with a low number of shots.
Qualitative analysis. To understand how to best employ our approach for few-shot sketch-based image classification, we provide the most and least effective sketches for image classification in Figure 7. Since categories are condensed to a single prototypical sketch, our approach desires sketches with details and in canonical configurations. Results are degraded when such assertions are not met.

Many-shot sketch-based 3D shape retrieval
Setup. Sketch-based 3D shape retrieval focuses on retrieving 3D shape models from a sketch query, where both training and testing samples share the same set of classes. We evaluate on three datasets. SHREC13 [22] is constructed from the TU-Berlin [13] and Princeton Shape Benchmark [45] datasets, resulting in 7,200 sketches and 1,258 3D shapes from 90 classes. The training set contains 50 sketches per class, the testing set 30. SHREC14 [24] contains more 3D shapes and more classes, resulting in 13,680 sketches and 8,987 3D shapes from 171 classes. The training and testing splits of sketches follow the same protocol as SHREC13. We also report on the recently outlined Part-SHREC14 [40], which contains 3,840 sketches and 7,238 3D shapes from 48 classes. The sketch splits also follow the same protocol, while the 3D shapes are now split into 5,812 for training and 1,426 for testing to avoid overlap. We generate 2D projections for all 3D shape models using the Phong reflection model [39] and render 12 different views by placing a virtual camera evenly spaced around the unaligned 3D shape model with an elevation of 30 degrees. We only aggregate the multiple views during testing to re- (c) Part-SHREC14. Table 4: Comparison 3 to many-shot sketch-based 3D shape retrieval on SHREC13, SHREC14, and Part-SHREC14. Having a metric space revolving around semantic prototypes benefits five out of six metrics.
duce complexity. We report six retrieval metrics [23]. The nearest neighbour (NN) denotes precision@1. The first tier (FT) is the recall@K, where K is the number of 3D shape models in the gallery set of the same class as the query. The second tier (ST) is the recall@2K. The E-measure (E) is the harmonic mean between the precision@32 and the re-call@32. The discounted cumulated gain (DCG) and mAP are also reported.
Results. Table 4 shows the results on all three benchmarks and six metrics. We compare to seven state-of-the-art baselines, which mostly focus on learning a joint feature space of sketches and 3D shapes with metric learning [6,17,43]. Across all three benchmarks, we observe the same trend, where we obtain the highest scores for five out of the six baselines. Only for the precision@1 metric (NN) do the recent approaches of Chen et al. [4] and Qi et al. [40] obtain higher scores on all three benchmarks. A reason for this behaviour is that both approaches directly optimize for the nearest neighbour metric. Indeed, Qi et al. [40] search in the label space while Chen et al. [4] perform a learned hashing. Overall, we conclude that our approach, while simple in nature, provides competitive results compared to the current state-of-the-art in sketch-based 3D shape retrieval.
Qualitative analysis. To gain insight in the pros and cons of our approach for retrieving 3D shapes from sketches, we provide qualitative examples in Figure 8. While rotations of unaligned shapes can be handled, confusion remains with visually and similar categories.

Conclusion
In