Partial materialization for online analytical processing over multi-tagged document collections

Drzadzewski, Grzegorz; Tompa, Frank Wm.

doi:10.1007/s10115-015-0871-2

Partial materialization for online analytical processing over multi-tagged document collections

Regular Paper
Open access
Published: 07 September 2015

Volume 47, pages 697–732, (2016)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Partial materialization for online analytical processing over multi-tagged document collections

Download PDF

9100 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

The New York Times Annotated Corpus, the ACM Digital Library, and PubMed are three prototypical examples of document collections in which each document is tagged with keywords or phrases. Such collections can be viewed as high-dimensional document cubes against which browsers and search systems can be applied in a manner similar to online analytical processing against data cubes. After examining the tagging patterns in these collections, a partial materialization strategy is developed to provide efficient storage and access to centroids for document subsets that are defined through queries over tags. By adopting this strategy, summary measures dependent on centroids (including measures involving medoids, sets of representative documents, or sets of representative terms) can be efficiently computed. The proposed design is evaluated on the three collections and on several synthetically generated collections to validate that it outperforms alternative storage strategies.

Big Scale Text Analytics and Smart Content Navigation

Known by the Company It Keeps: Proximity-Based Indexing for Physical Content in Archival Repositories

Document retrieval on repetitive string collections

Article Open access 01 April 2017

Travis Gagie, Aleksi Hartikainen, … Jouni Sirén

1 Introduction

Large document collections such as the New York Times Annotated Corpus and the ACM Digital Library cover many diverse topics. It can be a daunting task to decide what to read on a new topic or to find which combinations of topics deserve more attention. As an aid to readers, various tags (usually key words and phrases) are assigned to each document in these collections, reflecting the topics covered by that document. In a large collection, each tag can be assigned to hundreds or thousands of documents.

Following standard practice in information retrieval, we model a document as a bag of terms represented by a document term vector (DTV), which is a vector of values where each entry corresponds to a term together with the term’s (normalized) frequency in that document. A set of DTVs can be aggregated together to obtain a set centroid for the corresponding documents, which can be used to summarize the document set. Given the centroid, a system can produce other summary measures, such as a representative set of “bursty” terms [19, 37], the medoid [9, 18], or a diverse set of representative documents [12].

In addition to the DTV, each document in our collections is assigned a set of tags that are external to the document. Each metadata tag is a value chosen from a facet (such as location, time, organization, person, cost, or event) that corresponds to a conceptual dimension used to describe the data [49]. For simplicity, we assume that the facets are unstructured (i.e., that the value space within a facet is not hierarchically organized) and that each document is assigned zero or more values from each of the facets. Following standard practice, we assume that the assigned tags have been selected with care: They are typically of high quality and identify topics or important concepts found in the document.

A facet-based browsing environment supplements a traditional search engine by adding facilities that allow users to benefit from the metadata tags. With the help of faceted search a user may start to explore the ACM tagged document collection by issuing a traditional search request, say “databases cloud.” As in other systems, the user is presented with the top k matching documents, but in addition the user is also informed by the system of the tags associated with those documents. In response to “databases cloud,” the user might learn that all the corresponding documents are tagged “database,” 90 % are tagged “cloud computing,” 55 % of the top responses are also tagged “service-oriented architecture,” 35 % are tagged “security and privacy,” and 10 % are tagged “genome informatics.” (Instead of precise percentages, similar information might instead be provided in the form of tag clouds.) The user could then select tags of interest to formulate a refined query by issuing a Boolean query over tags (i.e., “slicing and dicing” the collection).

Faceted search helps to narrow a large result set down to a more manageable size for browsing, and a study at the University of North Carolina library showed that it was preferred by users over traditional search interfaces based on text content alone [40]. In a variety of other settings, user studies have found that systems supporting faceted search and browsing are superior to traditional search engines and to systems based on clustering by document content [17]. For example, Yee et al. [49] found that “Despite the fact that the interface was often an order of magnitude slower than a standard baseline, it was strongly preferred by most study participants. These results indicate that a category-based approach is a successful way to provide access to image collections.” Kipp and Campbell [29] found that “Users would find direct access to the thesaurus or list of subject headings showing articles indexed with these terms to be a distinct asset in search.” Hearst [23] concluded that “Usability results show that users do not like disorderly groupings like [those produced by clustering systems], preferring understandable hierarchies in which categories are presented at uniform levels of granularity.” Pratt et al. [38] found that “a tool that dynamically categorizes search results into a hierarchical organization by using knowledge of important kinds of queries and a model of the domain terminology $\ldots $ helps users find answers to those important types of questions more quickly and easily than when they use a relevance-ranking system or a clustering system.” Zhang and Marchionini [52] found that “[A faceted search and browsing] interface will bring users added values beyond simple searching and browsing by in fact combining these search strategies seamlessly.” Faceted search has thus emerged as a valuable technique for information access in many e-commerce sites, including Wal-Mart, Home Depot, eBay, and Amazon [45].

We envision an enhanced interface that, in addition to a traditional faceted search interface, provides summary information about the resulting document set. For example, a summary may consist of the k most representative articles in the sub-collection that satisfies the query, the most common terms used within articles in that sub-collection, and the distribution of tags that are assigned to articles in the sub-collection. If the summary matches the user’s information need, individual articles in that set can be retrieved; otherwise the user can reformulate the query (often “drilling down” by specifying additional tags or “rolling up” by removing some tags from the query) to arrive at a more appropriate set of articles. We have recently described a prototype of the enhanced interface in further detail [14]. For an analyst (or even a casual reader) armed with the New York Times, this approach might uncover sets of articles that provide a comprehensive summary of news reports on a specific subtopic of interest. For a computer scientist investigating an unfamiliar research area through the ACM Digital Library, providing summaries based on tag-based queries can identify the most relevant articles to read in the area and how those articles relate to topics identified by other tags.

We rely on the previous studies to validate the utility of faceted search and browsing: It is prevalent, effective, and satisfies users’ needs. In this paper, we concentrate on making such systems more efficient. If a document collection is already provided with meaningful metadata tags so that faceted search and browsing is feasible, the main problem that needs to be solved is to find a fast way of calculating centroids, which are required to provide summaries of document sets that match users’ tag-based queries. Because some sets can be very large, aggregating large amounts of data in order to calculate summary measures may be too time-consuming to be performed online. For conventional data warehouses, On-Line Analytical Processing (OLAP) systems have been developed in order to speed up the aggregation of multidimensional data through full or partial materialization of summaries. Similarly, we show that partial materialization is required in order to provide summaries at each step of a faceted search when space is limited.

Unfortunately, current OLAP systems are designed for data collections that have tens of dimensions and will not work for document collections that have hundreds of facets with millions of tags. To handle such a large number of dimensions, we propose to materialize centroids for sub-collections that correspond to all documents sharing small subsets of tags. Thus, centroids are stored for predetermined subsets of the data, and calculating centroids for arbitrary subsets corresponding to users’ queries requires aggregating data from several overlapping subsets (because documents with multiple tags will contribute to multiple materialized centroids). The techniques used in current OLAP systems, however, do not accommodate such overlap.

This paper includes the following contributions:

detailed analyses of tagging patterns in two representative multi-tagged document collections: the New York Times Annotated Corpus and the ACM Digital Library;
a storage design that performs well for calculating centroids of document sets that result from both short and long conjunctive queries over tags and enables aggregation of cells with overlapping data;
the development and evaluation of several partial materialization strategies for high-dimensional, sparse data.

The paper is organized as follows. Related work is described in Sect. 2, and requirements for a browsing system are proposed in Sect. 3. Next, the properties of prototypical multi-tagged document collections are introduced in Sect. 4. A new storage architecture for multi-tagged document collections that supports efficient computation of topic centroids is described in Sect. 5, and the partial materialization techniques that take advantage of it are described in Sect. 6. Then, in Sect. 7, the performance of the storage architecture and the materialization strategies are evaluated on three real and several synthetic collections. Conclusions and further work are summarized in Sect. 8.

2 Background and related work

2.1 Folksonomies and tag recommendation systems

The New York Times and the ACM Digital Library rely on tags being chosen by users with care so as to maximize the reuse of tags where applicable and to distinguish between concepts through the use of disjoint tag sets where possible. To achieve these ends, they employ controlled vocabulary for some facets and allow only limited use of uncontrolled vocabulary.

In contrast, many social Web sites, such as Delicious and Flickr, allow users to attach arbitrary tags to documents to organize content on the Web. Tags can be chosen by users at will, and different users may assign different tags to the same object. This results in so-called folksonomies [24] that include many tags per document, large tag vocabularies, and significant noise. Faceted browsing has been implemented over folksonomies in systems such as dogear [34], and the complex tagging patterns involved can benefit from more efficient exploration, which is the aim of our work.

User studies that examine users’ perceptions of the role and value of tags [28] showed that one common view of tags is as keywords (tags describe key aspects of the document) and another common view is for categorization. This is supported by another study [1], a taxonomy of tagging motivations in ZoneTag/Flickr, which concluded that one of the purposes of tags is to show context and provide content description.

To help reduce noise, there is much research on tag recommender systems, which are designed to help users assign tags to documents. Content-based tag recommender systems assume that tags for a specific resource can be extracted by processing the textual information about the resource to be annotated. These approaches adopt techniques from information retrieval [2] in order to extract relevant terms to label a resource. More specifically, term frequency and inverse document frequency have been shown to yield good keyword identification results [6, 16, 48], and their use has been adopted by tag recommendation systems [5]. Content authors and editors do not explicitly compute inverse document frequency when tagging an article, but their intuition regarding which words are informative replaces the use of this measure by human judgment.

In related work, a tag recommender system that relies on topic modeling was developed to provide an annotator with a set of diverse tags that represent the various topics covered in the document [3]. The generative model in the system simulated the users’ tagging process in a social tagging system. It assumed that for any resource there are a multitude of topics, and that when users tag a resource they first identify topics of interest from the resource, after which they express the chosen topics via a set of words (tags). Each topic accordingly corresponds to a probability distribution over tags, which gives the probability of picking out a tag with respect to a certain topic. The user studies performed as part of the evaluation of the system suggested that users preferred the tags suggested by this new system.

From these previous studies, we conclude that a carefully annotated document has tags representing all the topics that have sufficiently high presence in that document.

2.2 Browsing document collections

In place of faceted search, which has been described in the introductory section, browsing systems might rely on document clustering. For example, traditional search engines are designed to provide a user with a ranked list of documents that satisfy a query, and clustering may be performed on top of the result set in order to organize similar documents into groups [50]. Because clustering can be fully automated, it can be applied to text collections that have not been assigned metadata tags. However, if clustering is applied online and if it were to be applied to result sets consisting of thousands of documents, it would impose unacceptably long delays. To avoid this bottleneck, systems such as Clusty.com perform clustering on the top k results only. On the other hand, if clustering is applied offline, cluster labels can be interpreted to be metadata tags and the techniques proposed here can be similarly applied.

Scatter/Gather is a well-known document search interface based on clustering [9]. Users explore a document collection by dynamically clustering a set of documents (scattering), selecting clusters of interest based on their summaries, and then treating all documents in the selected clusters as one set (gathering). These steps are then repeated to further investigate the contents of the sub-collection. The summaries used to characterize clusters take the form of a set of representative terms, chosen on the basis of frequency alone, together with the headlines of the documents closest to the centroids.

Like Scatter/Gather, our proposal allows users to repeatedly select subsets of the document collection, examining summaries for each grouping of documents to determine whether or not to include specific groupings in the refinements. However, unlike Scatter/Gather, the system we envision is based on a multi-valued, faceted labeling for each document rather than on hard clusterings; thus even if cluster labels at each step were to be treated as if they were metadata tags for externally-specified classes, Scatter/Gather would correspond to a single-valued labeling of documents. Furthermore, in Scatter/Gather it is difficult for users to predict what clusters will be generated since the grouping criterion is hidden, unlike when aggregation is specified through tags visible to user. Finally, we envision a search system in which a user is free to broaden the search at any step, rather than being expected to restrain themselves to drilling down alone.

To make Scatter/Gather usable in an interactive manner, offline hierarchical clustering can be performed on the document collection [10]. In this approach, meta-documents corresponding to a union of documents are created offline, and during the scatter phase the meta-documents are clustered instead of the actual documents, thus reducing the number of items to be clustered and thereby reducing execution time. Document clustering is therefore only approximated. In addition, inter-document distances are computed based on selected features instead of the full text of the meta-documents, thus again reducing execution time. Interestingly, the third variant of our proposal stores centroids for meta-classes, somewhat akin to Scatter/Gather’s meta-documents, but those centroids are corrected to exact centroids for the associated classes before they are used in browsing.

An even faster implementation of Scatter/Gather (LAIR2) was developed by Ke et al. [27], based on precomputing a complete (binary) hierarchical clustering of the documents. Thus, for a collection with N documents, LAIR2 materializes $N-1$ clusters. Then, instead of clustering documents during the browsing stage, it retrieves prematerialized nodes from the cluster hierarchy. Like the previous approach, however, the authors are only concerned with improving the execution time and do not consider the storage cost required to store every sub-cluster of a full hierarchical clustering. In contrast, the amount of storage required by a browsing system, as well as execution time, is central to our work.

2.2.1 Tag exploration

One difficulty in browsing via tags is to determine which tags are present in the collection and how tags are related to each other. A query and browsing interface can display the distribution of tags that are assigned to articles in each result set, thereby suggesting tags that can be used for further refinement. For broadening a search, the system could display tags that are associated with carefully chosen supersets of the result set. Alternatively, the system could provide a mechanism to browse the tags themselves (as opposed to the documents associated with those tags) through an interface to a thesaurus or ontology [8, 36, 43]. We make no assumptions about the structure of the tag space for our work, and the incorporation of tag-browsing facilities is orthogonal to our work.

2.2.2 Multi-document summarization

The set of documents that result after each browsing or search step must be presented to the user in some form. Search engines, for example, display the top-k matches after ranking, and browsing systems can similarly present the k most representative documents of a result set, as is done in Scatter/Gather. As a special case, the medoid document, i.e., the one closest to the centroid, can be displayed. Another form of summarization is to display the most representative terms that appear in the result set, for which Scatter/Gather chooses the most frequently occurring terms, but representativeness might be defined using inverse document frequency as well or using other statistical measures, such as information gain.

Alternatively, a more informative summary of a result set may be a précis generated from the documents. There are many different approaches to perform such multi-document summarization, based on abstraction and information fusion, topic-driven summarization, clustering, graphs, and ranking [11, 21]. Of particular relevance here are multi-document summarization methods that rely on using the centroids of document sets [39], which is the measure for which we are designing an efficient infrastructure.

2.3 OLAP for data warehouses

Data cubes serve as the model for describing OLAP operations [20]. A cubes’ dimensions reflect the attributes that characterize the facts stored in the cube’s cells; for example, a set of sales records might have dimensions for date of sale, location of sale, type of product sold, customer demographics, etc. Because multidimensional analysis often requires aggregated measures over some of the dimensions (e.g., average sale prices for each product per day, regardless of location and customer), OLAP systems provide the materialization of selected cuboids defined over a subset of dimensions, storing precomputed aggregates in each resulting cell. The dimensionality of a cuboid is equal to the number of unaggregated dimensions, and the space is proportional to the number of cells (the product of the number of possible values in each unaggregated dimension). Thus a d-dimensional cuboid stores aggregated values in cells indexed by the possible values for each of the d unaggregated dimensions, and if each dimension is binary requires $O(2^d)$ space.

2.3.1 Full materialization

OLAP systems that materialize all possible cuboids offer the best response time to user queries. However, full materialization requires $O(2^n)$ space for cubes with n dimensions. Compression can be applied to achieve full materialization while reducing the storage cost; this can save space in situations where there is significant repetition in cell measures, as is the case with sparse cubes. Compression techniques for data cubes include condensed cubes [46], dwarf cubes [44], and quotient cubes [30]. However, these techniques do not scale to a high number of dimensions [32].

2.3.2 Partial materialization

Partial materialization techniques are used to materialize a subset of cuboids (also referred to as views) from the lattice of cuboids [22]. When answering a query, instead of fetching the data from the base cuboid and performing aggregation on it, the cuboid corresponding to the query can be calculated from the closest materialized superset cuboid. Therefore, the subset of cuboids to materialize is picked so as to minimize the time needed for the expected query workload, while requiring no more than a given amount of storage.

Thin cube shell materialization is a partial materialization where only the base cuboid and certain low-dimensional (most highly aggregated) cuboids are stored [32]. More specifically, in addition to the base cuboid, the strategy stores all cuboids having exactly d dimensions, where $d\ll n$, n is the total number of dimensions, and there are ${n \atopwithdelims ()d}$ d-dimensional cuboids. Alternatively, we could materialize all cuboids having d or fewer dimensions, which would further reduce the execution time of short queries at the expense of additional storage space. However, d-dimensional cuboids can be used to answer queries that involve at most d dimensions only; this involves choosing a materialized cuboid and aggregating the data for the dimensions omitted in the query. On the other hand, queries involving more than d dimensions are answered by aggregating over the base cuboid. Picking a larger d for materialization results in increased storage cost and increases the time required to calculate queries with few dimensions, but picking a small d results in much longer computation time for queries with more than d dimensions. If the expected workload has a wide range of queries, there may not be a fixed d that is appropriate.

As an improvement over a thin cube shell, Li et al. [32] proposed a shell fragment approach for dealing with high-dimensional cubes. The technique relies on the assumption that high-dimensional data have limited interactions among dimensions (tags). It assumes that on average any one tag interacts with at most K other tags, where K is at most five and these tag interactions can usually be well clustered. Under such circumstances when a collection has T unique tags, it can be partitioned into T / K nonoverlapping fragments. Depending on the properties of the data and the query workload, it may be necessary to choose fragments of various sizes. However, larger fragments require more storage space. If the tag interactions cannot be clustered well, it may be necessary to store overlapping fragments to provide satisfactory query response time, in which case more fragments need to be stored. This, in turn, leads to greater storage requirements. For each of these fragments a full cube materialization is stored; thus, all the cuboids of dimensions ranging from 1 to K are materialized. This results in $2^K-1$ cuboids materialized per fragment, where a cuboid with d dimensions has $2^d-1$ cells, which therefore implies $\sum _{i=1}^{K}{K \atopwithdelims ()i}(2^i -1)$ cells for a fragment. For a fragment of size $K=3$, 19 cells per fragment are needed. For scenarios in which the prematerialized fragments do not enclose the user’s query, again the view needs to be calculated from the base cuboid, which can be time-consuming.

2.4 Document warehouses

A document warehouse is like a data warehouse, except that instead of performing analyses over tabular data, it supports analyses over documents. OLAP in document warehouses has been used to provide users with summaries of related documents through the use of centroids and medoids of the clusters found in cells of a cube [26, 51]. Efficient storage strategies for OLAP over nonoverlapping sets of documents have been proposed [51], and a fully materialized approach that deals with overlapping sets has also been proposed [26], but efficient storage strategies that can handle overlapping document sets—the focus of this paper—have not been explored. In tagged document collections, tags are treated as dimension values. Two different forms of schema can be used for determining how tags are assigned to dimensions: multidimensional schemas and single-dimensional schemas.

2.4.1 Multidimensional schema

A multidimensional schema (MDS) stores each tag in a separate binary dimension, where 0 signifies that the corresponding tag is not assigned and 1 signifies that it is. For example, if a document ${{d}}_1$ has tags (Finances, Stocks) and ${{d}}_2$ is tagged with (Stocks) only, then ${{d}}_1$ is stored in cell (1, 1) and ${{d}}_2$ is stored in cell (0, 1) of the 2D cuboid with those two dimensions. This cuboid can answer the query $ Finances \vee Stocks $ by aggregating cells (1, 1), (0, 1), and (1, 0) together, where, for this small example, the cell (1, 0) is empty. By having a separate dimension for each tag, we can ensure that aggregations performed on a cuboid do not double count any documents. Storing a data cube for MDS is a challenge when there are many tags.

2.4.2 Single-dimensional schema

A single-dimensional schema (SDS) stores all tags in one dimension. The dimension can take on a value ranging from 1 to T, where T is the number of unique tags in the collection. This approach works well in situations where each document is assigned only a single tag. Zhang et al. [51] used this approach for organizing a collection of documents into nonoverlapping cells and developed a partial materialization scheme on top of it.

In contrast, Jin et al. [26] used SDS for storing documents with multiple tags. Unfortunately, this can result in the same document being assigned to multiple cells, which is problematic when the cells in a cuboid are aggregated. Continuing with the example above, because there is only one “tags” dimension, cell(Finances) stores ${{d}}_1$ and cell(Stocks) stores both ${{d}}_1$ and ${{d}}_2$. In this situation, simply adding the counts for cell(Finances) and cell(Stocks) to count the number of results for the query $ Finances \vee Stocks $ will result in double counting ${{d}}_1$. We adopt the solution to this problem developed by Jin et al., namely storing document membership information for each cell, so that when multiple cells are aggregated, cell overlaps can be detected and compensations applied. Jin et al. use a full materialization on a small data set and focus on the union operation only; optimizing conjunctive queries involving overlapping cells has not been considered.

3 System requirements

In this section we describe requirements and associated challenges for a system that will support online analytical processing for a large document collection. The requirements are derived in part by examining characteristics of the PubMed interface to biomedical literature [15].

PubMed includes more than 24 million abstracts and corresponding citations to articles, which are annotated with a variety of tags^{Footnote 1} chosen from Medical Subject Headings (MeSH), the National Library of Medicine controlled vocabulary thesaurus used for indexing articles for PubMed; the EC/RN Number, assigned by the Food and Drug Administration (FDA) Substance Registration System for Unique Ingredient Identifiers; and Supplementary Concept tags, which include chemical, protocol or disease terms. The PubMed interface supports searching for documents using a standard text search, matching query terms against the abstract, the citation, and all assigned metadata tags, as well as by specifying that some of the query terms should be restricted to matching MeSH terms (or some other facet) only.

A corpus of MeSH terms assigned to PubMed documents^{Footnote 2} includes 244,553,378 tags assigned to 20,997,401 documents, or 11.65 tags per document on average. The corpus identifies 71,690,729 assigned tags as “major,” that is, the topics play a major part in the associated paper, yielding on average 3.28 major tags per document.

PubMed users looking for relevant articles can benefit immensely from searching with the aid of metadata tags [35]. Since PubMed is a very large collection and the sizes of sets of search results are often large, it can certainly benefit from more efficient calculation of aggregate measures that summarize the contents of query results.

3.1 Supported measures

As explained in Sect. 1, a document is considered to be a bag of terms, represented by its document term vector (DTV). All but the top m terms, based on mutual information, can be ignored so as to avoid storing stop words and other uninformative terms, and for every document, the frequency of each remaining term is stored as a normalized DTV.

In order to provide meaningful summaries about a document set (e.g., its medoid, any set of representative documents, or a set of representative terms), we need to compute the set centroid C, which can be represented by a vector of term frequencies equal to the mean of all the DTVs for documents that belong to that set. However, instead of storing the means directly, for a set of documents S, we store its centroid $C_S$ as a dictionary that maps terms to (sum, count) pairs, which is then easily updated when documents are added to or removed from the set:

$$\begin{aligned} C_S[ term ]. sum= & {} \sum _{d \in S }{d[ term ]} \end{aligned}$$

(1)

$$\begin{aligned} C_S[ term ]. count= & {} |\{d \in S ~|~ d[ term ]>0\}| \end{aligned}$$

(2)

where d is a normalized DTV of length m. Thus, a set centroid vector has length m regardless of how many documents are in the set.

3.2 Supported queries

Associated with each document is a set of “metadata” tags, each of which is assumed to represent some aspects of the document’s content. We allow the user to pose queries as Boolean formulas over tags, such as ${Election}\wedge {President}\wedge ({Stocks}\vee {Stock\_Market})$. Conjunctions of terms narrow down the scope of documents to those that involve all the concepts represented by the conjuncts. The use of negation is allowed, but only in the form “and not” to allow a conjunction with the complement of the documents having a given term, as in the example $President \wedge \lnot Election$. Disjunction provides a means of query expansion, allowing synonyms and related tags to be included in a query [33].

3.3 Expected workload

Users explore a multi-tagged document collection through a browser front end that enables them to invoke Boolean queries. As part of their exploration, they may pose queries and read summaries (in the form of data derived from set centroids, such as representative documents or representative terms). After users examine summaries of document sets, they may choose to drill down to smaller subsets of documents by issuing more specific queries. The browsing system we are developing^{Footnote 3} is required to provide quick responses to the generated queries.

We rely on data from a PubMed query log^{Footnote 4} [35] to help characterize a feasible query workload. Among 2,996,301 queries collected over a single day, 16,928 queries include only terms chosen from facets that have a controlled vocabulary (specifically, MeSH terms [MH], MeSH major topics [MAJR], MeSH subheadings [SH], filters [FILTER], EC/RN Numbers [RN], and supplementary concepts [NM]), with the possible addition of one or more pure text terms. Treating each text conjunct or disjunct as if it were a single tag, these queries involve anywhere from 1 to 46 tags, with the majority of the queries using between 1 and 3 tags (Fig. 1). Figure 2 shows the number of distinct tags that co-occur in queries having a given query tag, from which we observe that tags are used repeatedly in queries in a variety of contexts specified by other tags. The usage patterns of the three Boolean operators are summarized in Fig. 3, which shows that the “NOT” operator occurs in 1 % of queries, the “OR” operator occurs in 18 % of the queries, and the “AND” operator occurs in 62 % of the queries; 31 % of queries are of length 1 and so use no operators. These observations suggest that a realistic query workload will likely include primarily short queries that are predominantly conjunctive, as has been observed for other Web search systems [25].

In summary, our expectation is that (after some preliminary traditional searches of the text) users explore a collection by starting with a small set of tags of interest and then iteratively refining their queries to be more focused by including additional tags. For some queries, query expansion will be applied to incorporate some alternative tags. Thus, most queries will be conjunctions of tags (i.e., no negations and only occasionally disjunctions to accommodate alternative tags), most queries will be short, and most queries will match sets that include a large number of documents.

3.4 Design objective

We wish to provide a fast response to user queries by having an upper bound on the number of documents or centroids of materialized sets that need to be retrieved from secondary storage. At the same time we wish to minimize the number of set centroids that need to be precomputed and materialized to accomplish this. We focus on providing upper bound guarantees on execution costs for (positive) conjunctive queries, since they are expected to be most frequent: For each such query, no more than k DTVs or set centroids need to be accessed, for some fixed k. Queries that involve disjunction and negation will be answered using multiple conjunctive subqueries, and they may therefore require more than k DTVs or set centroids in total.

A document cube provides an excellent mechanism for structuring the collections of documents so as to answer Boolean queries on tags. Each cell in the cube represents the set of documents that have a specific tag assignment, and each cuboid represents document sets that are aggregated (“rolled up”) by grouping on specific tags and ignoring others. As a result, conjunctive tag queries can be answered by selecting specific cells from appropriate cuboids, and centroids of document sets that correspond to other Boolean queries can be computed by combining the centroids from selected cuboid cells. The problem to be addressed is to determine which cells or cuboids to materialize to balance space and time.

4 Document collections

To motivate the design of our proposed index, we evaluate document collections from two different domains: the New York Times Annotated Corpus (NYT) [41] and the ACM Digital Library (ACM) [47].

4.1 New York Times Annotated Corpus

The NYT collection includes 1.8 million articles spanning 20 years. The collection has 1 million tags that cover many different facets, such as people, places, companies, and descriptors, and multiple tags can be assigned to each article. Out of the various types of tags contained in the collection, we consider only the tags found in the general online descriptors, which are the ones that correspond to the text found in the articles. Table 1a shows a tag assignment for a single document found in the NYT collection. In our analysis we consider only tags that have been assigned to at least 200 documents, yielding 1015 such tags that are applied to 1.5 million documents.

Table 1 An article from (a) NYT with corresponding general online descriptors assigned to it, and (b) ACM with corresponding category and keyword tags assigned to it

Partial materialization for online analytical processing over multi-tagged document collections

Abstract

Similar content being viewed by others

Big Scale Text Analytics and Smart Content Navigation

Known by the Company It Keeps: Proximity-Based Indexing for Physical Content in Archival Repositories

Document retrieval on repetitive string collections

1 Introduction

2 Background and related work

2.1 Folksonomies and tag recommendation systems

2.2 Browsing document collections

2.2.1 Tag exploration

2.2.2 Multi-document summarization

2.3 OLAP for data warehouses

2.3.1 Full materialization

2.3.2 Partial materialization

2.4 Document warehouses

2.4.1 Multidimensional schema

2.4.2 Single-dimensional schema

3 System requirements

3.1 Supported measures

3.2 Supported queries

3.3 Expected workload

3.4 Design objective

4 Document collections

4.1 New York Times Annotated Corpus

4.2 ACM digital library

4.3 Deeper analysis of tagging patterns

4.3.1 Higher order tag co-occurrence

4.3.2 Overlap between document sets

4.4 Significance of tagging patterns

5 Storage architecture

5.1 Basic infrastructure

5.2 Storing document member sets

5.3 Granularity of materialization

5.3.1 Full cuboid materialization

5.3.2 Individual cell materialization

5.4 Query performance evaluation

5.5 Storage performance evaluation

5.5.1 Thin cube shell

5.5.2 Shell fragments

6 Partial materialization strategies

6.1 Threshold materialization

6.2 Threshold materialization with ancestors

6.3 Materialization of cluster centroids

6.4 Comparative example of materialization strategies

7 Performance of partial materialization

7.1 Storage cost

7.2 Query execution cost

8 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation