Skip to main content
Log in

Classifying and querying very large taxonomies with bit-vector encoding

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

In this article, we address the question of how efficiently Semantic Web (SW) reasoners perform in processing (classifying and querying) taxonomies of enormous size and whether it is possible to improve on existing implementations. We use a bit-vector encoding technique to implement taxonomic concept classification and Boolean-query answering. We describe the technique we have used, which achieves high performance, and discuss implementation issues. We compare the performance of our implementation with those of the best existing SW reasoning systems over several very large taxonomies under the exact same conditions for so-called TBox reasoning. The results show that our system is among the best for concept classification and several orders-of-magnitude more efficient in terms of response time for query answering. We present these results in detail and comment them. We also discuss pragmatic issues such as cycle detection and decoding.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. In Semantic Web lingo, a Knowledge Base (KB) is defined as a formal ontology consisting of two parts (or “boxes”): (1) a Terminological Box (abbreviated as TBox); and, (2) an Assertional Box (abbreviated as ABox). The TBox contains the formal axioms that define the structure and semantic properties of the actual instance data; which instance data constitute the ABox. In Database lingo, the TBox corresponds to the schema and the ABox to the actual data.

  2. http://www.cs.ox.ac.uk/isg/tools/ELK/

  3. http://code.google.com/p/cel/

  4. http://code.google.com/p/cb-reasoner/

  5. http://owl.cs.manchester.ac.uk/fact++/

  6. http://www.hermit-reasoner.com/

  7. http://clarkparsia.com/pellet/

  8. http://trowl.eu/

  9. http://www.racer-systems.com/products/racerpro/

  10. http://research.ict.csiro.au/software/snorocket

  11. Description Logics in the \(\mathcal {EL}\)-family are weaker versions that provide existential roles (∃r.C) but no universal roles (∀r.C).

  12. To the best of our knowledge, this is the latest best bound as of 2011. However, these algorithms are not implementable due to prohibitive size of constants. For more recent work on parallelizing Strassen’s algorithm, see (Ballard et al. 2014). This, however, requires special harware (GPGPUs).

  13. op. cit., pages 125–126.

  14. See Section 4.2.

  15. http://www.h-its.org/en/research/nlp/wikitaxonomy/

  16. http://bioportal.bioontology.org/ontologies/3022

  17. http://www.nlm.nih.gov/mesh/meshhome.html

  18. http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/

  19. See Section 3.3 for a discussion concernining this point.

  20. We use “ &” to denote “and,” and “ |” to denote “or.”

  21. We implemented such a facility—see Appendix.

  22. http://en.wikipedia.org/wiki/Quicksort

  23. The ordering on bit-vector codes is simply defined as c 1c 2 iff c 1=c 1 & c 2.

  24. In the same manner as we have noticed that SnoRocket does.

  25. http://cedar.liris.cnrs.fr/data/CEDAR-V1.0.zip

  26. http://cedar.liris.cnrs.fr/demos.html

  27. Recall that a bit vector is written with its lowest bit to the right.

  28. In what follows, we shall use the “dot” notation of object-oriented methods to denote all functions or operations on codes.

  29. Note that ij implies necessarily that i<j (by Condition (1) and since n<m).

  30. Actually, the java.util.TreeSet does maintain a doubly-linked list for its elements in order to ensure its two ordered iterators (ascending and descending). But this structure is not made public and one cannot splice in new elements from a given found element. But it is a simple matter to modify the source code of java.util.TreeSet.java and adapt it to what is needed.

  31. In fact, they see that only as a possible a posteriori optimization, but one that would cause their optimal spanning-tree finding algorithm to be incorrect if applied incrementally while it is executed.

References

Download references

Acknowledgments

The authors wish to thank Prof. Mohand-Saïd Hacid and the anonymous referees for constructive feedback.

Conflict of interests

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hassan Aït-Kaci.

Additional information

Funding

This work was carried out as part of the 𝓒𝓔𝓓𝓐𝓡 Project (Constraint Event-Driven Automated Reasoning) under the Agence Nationale de la Recherche (ANR) Chair of Excellence grant N ° ANR-12-CHEX-0003-01.

Appendices

Appendix

In Section A, we give an overview of an implementation specification for representing very large bit vectors, reducing memory-space consumption while retaining efficient operations. In our experiments, this alternative code representation was used only for saving an encoded taxonomy on disk and reloading it as a pre-encoded order. But for taxonomies of even larger size than those used in our experiments, it could be used for lattice operations as well.

A Compact codes

While the foregoing sections present a method for encoding elements of a partially ordered set based on transitive closure, the data structure it relies on is that of a binary word—i.e., a bit vector. With such a structure, all boolean operations—and, or, not—are thus very efficient. This representation also eases computation of the transitive closure since setting a bit on or off is trivially accommodated.

However, while this representation is convenient and time-efficient for relatively small posets of the order of a few hundred elements, it quickly becomes space-inefficient for large posets of hundreds of thousands, or millions of elements.

In what follows, we define an alternative representation of indexed bit sets that offers the advantage of being more compact than bit vectors while retaining time-efficient boolean and bit-setting operations. It is also the format we use to save/load encoded taxonomies on/from disk. In Section A.1, the basic data structure is defined. In Section A.2, bit setting and unsetting operations are defined. In Section A.3, the three boolean operations—conjunction, disjunction, and negation—are defined. In Section A.4, some implementation considerations are discussed.

1.1 A.1 Bit code representation

The idea is intuitively simple. It consists of representing a bit vector as a finite array of k (\(k\in \mathbb {N}\)) pairs of integer indices 〈l i ,u i 〉, for \(i=0,\dots ,k-1\), such that, for all indices \(i=0,\dots ,k-2\):

$$ 0 \leq l_{i} < u_{i} < l_{i+1} < u_{k-1}. $$
(1)

We shall refer to such a sequence k pairs, \(k\in \mathbb {N}\), { 〈l i ,u i 〉 | i=0,…,k−1 } as a compact code. For k=0, this is written as the empty sequence {}.

Given a compact code representation of a bit vector V, each pair 〈l,u〉 represents a maximal contiguous sequence of 1’s (hereafter referred to as a “packet”) in V. Thus, the i-th packet of a bit vector is represented as the pair of indices 〈l i ,u i 〉 such that l i is the index of the lowest bit in the packet, and u i is the index of the first 0-bit following the packet.

For example, the bit vector 0011111001111000000110000 corresponds to the compact code (i.e., sequence of packet pairs):Footnote 27 { 〈4,6〉,〈12,16〉,〈18,23〉 }.

The empty bit vector (containing all 0’s) is represented as the empty sequence {}. The length of a bit vector represented by a compact code sequence of k pairs (or packets) is u k . The size of a compact code C of k pairs (or packets) is k (i.e., its number of packets).

1.2 A.2 Bit operations

Let C={ 〈l i ,u i 〉 | i=0,…,k } be a compact code of k packets (k>0). Given a number \(n\in \mathbb {N}\), and a compact code C of k packets as defined above, we say that:Footnote 28

  • n is within a packet of C iff ∃i∈[0,k−1] such that l i n<u i —in which case we shall write C.p a c k e t(n)=i;

  • n is between packets of C iff either one of the three statements holds:

    1. 1.

      n<l 0; or,

    2. 2.

      u k−1n; or,

    3. 3.

      i∈[0,k−2] such that u i n<l i+1.

If a number n is between packets of a compact code C of size k, we define two functions C.p r e v(n) and C.n e x t(n) for each of the three possible respective cases above as follows (where the symbol ‘?’ means “undefined”):

  1. 1.

    \(C.\mathbf {prev}(n) \overset {def}{=\!\!=}\ \texttt {?}\) and \(C.\mathbf {next}(n) \overset {def}{=\!\!=}\ l_{0}\);

  2. 2.

    \(C.\mathbf {prev}(n) \overset {def}{=\!\!=}\ u_{k}\) and \(C.\mathbf {next}(n) \overset {def}{=\!\!=}\ \texttt {?}\);

  3. 3.

    \(C.\mathbf {prev}(n) \overset {def}{=\!\!=}\ u_{i}\) and \(C.\mathbf {next}(n) \overset {def}{=\!\!=}\ l_{i+1}\).

For such a number n, we say that:

  • n is left-adjacent in C if n=C.n e x t(n)−1;

  • n is right-adjacent in C if n=C.p r e v(n);

  • n is adjacent in C if it is both left-adjacent and right-adjacent in C.

Note that if n is between packets and adjacent, this necessarily means that the two packets on each side are only separated by a single 0-bit (the bit in position n in the denoted bit vector).

N.B.: In all the compact code expressions to follow, we use the implicit convention that a packet with undefinable bounds is simply omitted. Thus, we will always use the notation { 〈l 0,u 0〉,…,〈l k−1,u k−1〉 } to denote a compact code, where k≥0 up to the above conventions regardless of the actual number of packets. For example, for k=0 this will correspond to the empty code {}, and for k=1, this will correspond to the single-packet code { 〈l 0,u 0〉 }.

We define the following bit-setting operations on C. These methods operate “in place” by modifying a code C that invokes them.

  • C.s e t(n), for \(n\in \mathbb {N}\), which sets the n-th bit of the bit vector denoted by C to 1.

  • C.s e t(n,m), for \(n,m\in \mathbb {N}, n<m\), which sets to 1 all the bits from position n (inclusive) to position m (exclusive) of the bit vector denoted by C.

  • C.u n s e t(n), for \(n\in \mathbb {N}\), which sets the n-th bit of the bit vector denoted by C to 0.

  • C.u n s e t(n,m), for \(n,m\in \mathbb {N}, n<m\), which sets to 0 all the bits from position n (inclusive) to position m (exclusive) of the bit vector denoted by C.

For mn, both C.s e t(n,m) and C.u n s e t(n,m) are no_ops—i.e., they leave C unchanged. Since s e t(n) is equivalent to s e t(n,n+1), we will just give the methods for s e t(n,m) and similarly for u n s e t(n).

1.2.1 A.2.1 Bit setting

There are four cases to consider for which performing C.s e t(n,m) modifies C as follows.

  1. 1.

    If n is within a packet in C (say, C.p a c k e t(n)=i) and m is within a packet in C (say, C.p a c k e t(m)=j), then, if i=j, C.s e t(n,m) leaves C unchanged. Else (if i<j),Footnote 29 then C becomes:

    $$\{~\ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}. $$
  2. 2.

    If n is within a packet in C (say, C.p a c k e t(n)=i) and m is between packets in C, then C becomes:

    $$ \left\{ \begin{array}{l} \{~ \ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}\\ ~~~\text{if}~m~\text{is~left-adjacent~in}~C~\text{and}~C.\mathbf{next}(m) = l_{j}; \\\\ \{~\ldots, \langle{l_{i},m}\rangle, \ldots~\}\\ ~~~\text{otherwise}. \end{array} \right. $$
    (2)
  3. 3.

    If n is between packets in C and m is within a packet in C (say, C.p a c k e t(m)=j), then C becomes:

    $$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}\\ ~~~\text{if}~n~\text{is~right-adjacent~in}~C~\text{and}~C.\mathbf{prev}(n) = u_{i}; \\\\ \{~\ldots, \langle{n,u_{j}}\rangle, \ldots~\}\\ ~~~\text{otherwise}. \end{array} \right. $$
    (3)
  4. 4.

    If both n and m are between packets in C, then C becomes:

    $$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}\\ ~~~\text{if}~n~\text{is~right-adjacent~in}~C~\text{and}~C.\mathbf{prev}(n) = u_{i},~\text{and}\\ ~~~\text{if}~m~\text{is~left-adjacent~in}~C~\text{and}~C.\mathbf{next}(m) = l_{j}; \\\\ \{~\ldots, \langle{l_{i},m}\rangle, \ldots~\}\\ ~~~\text{if}~n~\text{is~right-adjacent~in}~C~\text{and}~C.\mathbf{prev}(n) = u_{i},~\text{and} \\ ~~~\text{if}~m~\text{is~not~left-adjacent~in}~C; \\\\ \{~\ldots, \langle{n,u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n~\text{is~not~right-adjacent~in}~C,~\text{and}\\ ~~~\text{if}~m~\text{is~left-adjacent~in}~C~\text{and}~C.\mathbf{next}(m) = l_{j}; \\\\ \{~\ldots, \langle{n,m}\rangle, \ldots~\}\\ ~~~\text{otherwise}. \end{array} \right. $$
    (4)

1.2.2 A.2.2 Bit unsetting

Here again, there are four cases to consider for which performing C.u n s e t(n,m) modifies C as follows.

  1. 1.

    If both n and m are between packets in C: if C.p r e v(n)=C.p r e v(m) (or, equivalently, if C.n e x t(n)=C.n e x t(m)), then C.u n s e t(n,m) leaves C unchanged; else, C becomes:

    $$\{~\ldots, \langle{l_{i},C.\mathbf{prev}(n)}\rangle,\langle{C.\mathbf{next}(m),u_{j}}\rangle, \ldots~\}. $$
  2. 2.

    If n is within a packet in C (say, C.p a c k e t(n)=i) and m is between packets in C, then C becomes:

    $$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i-1},u_{i-1}}\rangle, \langle{C.\mathbf{next}(m),u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n=l_{i},~\text{where}~C.\mathbf{next}(m)=l_{j}; \\ \\ \{~\ldots, \langle{l_{i},n}\rangle, \langle{C.\mathbf{next}(m),u_{j}}\rangle, \ldots~\} \\~~~ \text{else}~(\textit{i.e.},~\text{if}~n>l_{i}),~\text{where}~C.\mathbf{next}(m)=l_{j}. \end{array} \right. $$
    (5)
  3. 3.

    If n is between packets in C and m is within a packet in C (say, C.p a c k e t(m)=j), then C becomes:

    $$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i},C.\mathbf{prev}(n)}\rangle, \langle{l_{j+1},u_{j+1}}\rangle, \ldots~\} \\ ~~~\text{if}~m=u_{j}-1, \text{where}~C.\mathbf{prev}(m)=u_{i}; \\\\ \{~\ldots, \langle{l_{i},C.\mathbf{prev}(n)}\rangle, \langle{m,u_{j}}\rangle, \ldots~\} \\ ~~~\text{else}~(i.e., if m<u_{j}-1),~\text{where}~C.\mathbf{prev}(m)=u_{i}. \end{array} \right. $$
    (6)
  4. 4.

    If both n is within a packet (say, C.p a c k e t(n)=i), and m is within a packet (say, C.p a c k e t(m)=i) in C, then C becomes:

    $$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i-1},u_{i-1}}\rangle, \langle{l_{j+1},u_{j-1}}\rangle, \ldots~\} \\ ~~~\text{if}~n=l_{i},~\text{and} \\ ~~~\text{if}~m=u_{j-1}; \\\\ \{~\ldots, \langle{l_{i-1},u_{i-1}}\rangle, \langle{m,u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n=l_{i},~\text{and} \\ ~~~\text{if}~m<u_{j-1}; \\\\ \{~\ldots, \langle{l_{i},n}\rangle, \langle{l_{j+1},u_{j-1}}\rangle, \ldots~\} \\ ~~~\text{if}~n>l_{i},~\text{and} \\ ~~~\text{if}~m=u_{j-1}; \\\\ \{~\ldots, \langle{l_{i},n}\rangle, \langle{m,u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n>l_{i},~\text{and} \\ ~~~\text{if}~m<u_{j-1}. \end{array} \right. $$
    (7)

1.3 A.3 Boolean operations

Let:

$$C = \{~\langle{l_{i},u_{i}}\rangle~|~i=0,\ldots,k-1~\} $$

and:

$$C' = \{~\langle{l'_{i},u'_{i}}\rangle~|~i=0,\ldots,k'-1~\} $$

be two compact code pair sequences, with k≥0 and k ≥0.

1.3.1 A.3.1 Conjunction

Invoking C.a n d(C ) will modify C according to C by unsetting all the bits in C that are between packets in C , leaving C unchanged.

If C={}, then C is left unchanged; else, if C ={}, then C becomes {}.

Else (i.e., if k>0 and k >0), C is modified by invoking:

  • C.u n s e t(0,l0′); and,

  • C.u n s e t(u i′,l i+1′), for i=0 up to i=k −1; and,

  • C.u n s e t(u k −1′,u k−1).

Note that in practice, when proceeding in the above order, as soon the first argument of any u n s e t(…) is greater than or equal to u k−1, there is no need to perform the unsetting nor proceed any further.

1.3.2 A.3.2 Disjunction

Invoking C.o r(C ) will modify C according to C by setting all the bits in C that are within packets in C , leaving C unchanged.

If C ={}, then C is left unchanged; else, if C={}, then C becomes (a copy of) C .

Else (i.e., if k>0 and k >0), C is modified by invoking:

  • C.s e t(l i′,u i′), for i=0 up to i=k −1.

1.3.3 A.3.3 Negation

Since a bit vector is open-ended, we may define its negation only up to a length at least greater than its highest 1-bit position. This operation is denoted as C.n o t(n). Thus, {}.n o t(n), is undefined for any n≥0.

Otherwise, for a non-empty code C={ 〈l 0,u 0〉,…,〈l k−1,u k−1〉 } and nu k−1, C.n o t(n) modifies C to become:

$$\{~\langle{0,l_{0}}\rangle, \ldots, \langle{u_{i},l_{i+1}}\rangle, \ldots, \langle{u_{k-1},n}\rangle~\}. $$

Again, following our convention, if l 0=0, 〈0,l 0〉 being undefinable, the first element of C.n o t(n) is 〈u 0,l 1〉. Similarly, if n=u k−1, then 〈u k−1,n〉 is undefinable and the last element of C.n o t(n) is 〈u k−2,l k−1〉.

1.4 A.4 Implementation considerations

We need to come up with a data structure for representing a compact code that would enable retaining maximal efficiency in the bit setting and unsetting operations, and hence in the boolean operations that rely on them.

Most frequently used operations on such a data structure C for an integer n are:

  • C.p a c k e t(n)—for n inside a packet in C, returning that packet number;

  • C.p r e v(n)—for n between packets in C, returning the upper index of the packet preceding n;

  • C.n e x t(n)—for n between packets in C, returning the lower index of the packet following n;

  • adding/removing a packet.

Because the elements of a code sequence are ranges rather than integers, one cannot expect hashed Ø(1) time access to find out whether a given integer lies within or between packets. So structures such as defined by the Java classes java.util.HashSet or java.util.LinkedHashSet cannot be used.

In order to make these operations at most \(\mathcal {O}(\log (k))\) time for a compact code of size k, one way is to represent a compact code as a balanced binary tree of pairs of bit position spans 〈l i ,u i 〉, taking advantage of the ordering imposed by condition (1).

Thus, the java.util.TreeSet class looks like a convenient choice, since it offers the required data structure properties in addition to defining methods such as first, last, higher, lower, add, remove, etc., as well as an order-respecting iterator.

On the other hand, the java.util.TreeSet is missing a replace method—which is critically needed for setting and unsetting bits. It is also missing an insert method that splices a new sequence of packet pairs into an existing compact code, which may also be often used. One must resort to several add/remove method invocations to replace or insert elements, which incur new searches (and possible intermediate rebalancing of the tree) each time. This is a waste since replacing and inserting can be done in \(\mathcal {O}(1)\) time when having already found the required elements, and only one (final) \(\mathcal {O}(\log (k))\) tree rebalancing.

Hence, rather than relying on the ready-to-use java.util.TreeSet class, it may be more beneficial to implement a new specific class for a compact code as a doubly linked list and a balanced binary tree adapted for the specific nature of its pair elements. This would make transparent the double links of each pair element and ease replacement and insertion.Footnote 30

1.5 A.5 Discussion

A similar data structure was proposed by researchers in data and knowledge bases in (Agrawal et al. 1989). However, the authors did not use that representation for lattice operations as we do here. Instead, they focused on using it for obtaining more compact range-sequence codes for the transitive closure of the “is-a” relation of a taxonomy. That representation is equivalent to the one we specify here and to the one in (Aït-Kaci et al. 1989). Contrary to (Aït-Kaci et al. 1989), they define an element’s code as the union of index ranges from the post-order arrangement of the a spanning tree of the “is-parent-of” relation of a taxonomy. Each concept in the taxonomy (i.e., each element in the poset) of post-order index j is then encoded as the interval 〈i,j〉 where i is the smallest post-order index of all its descendants. Although they did not do it, it is easy to show that their representation is equivalent to bit vectors. But they did not specify lattice operations on their data structures as we do for ours in this document. What they focused on was minimizing the total number of packets in range codes. In order to do so, they suggest generating codes based on the “optimal” spanning tree for generating the most compact set of codes. The data structure and algorithm for what they call “compressed transitive closure” do not maintain dynamic interval consistency caused by potential adjacency as we do here.Footnote 31 Although they do cite (Aït-Kaci et al. 1989), Agrawal et al. do so only in the conclusion as they had just noticed its publication. They suggest that their approach and that exposed in Aït-Kaci et al. (1989) might be combined for processing large taxonomies. As far as we know, no follow-up on this suggestion was carried out.

It is clear how the work of (Agrawal et al. 1989), although orthogonal to ours, could be adapted to our needs as well in order to improve its space consumption. However, it is to be noted that their code-compaction method requires a topologically ordered poset. For very large taxonomies (over 1 million elements) the price of sorting the taxonomy might be worth spending only for once-for-all prepocessing prior to query time (Aït-Kaci and Amir 2013; Amir and Aït-Kaci 2013). Also, the question of incrementality is not addressed.

Finally, although this work has been motivated for obtaining a compact representation of binary codes encoding a partial order, it comes as evident that the data structure, and operations on it, specified in this appendix can represent any set of integers (or integer-indexed set) seen as a sequence of intervals. Set intersection is realized as the conjunction described in Section A.3.1; set union as the disjunction described in Section A.3.2; and, set complementation as the negation described in Section A.3.3. Therefore, it can readily be used for this purpose as well.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aït-Kaci, H., Amir, S. Classifying and querying very large taxonomies with bit-vector encoding. J Intell Inf Syst 48, 1–25 (2017). https://doi.org/10.1007/s10844-015-0383-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-015-0383-2

Keywords

Navigation