Abstract
In this article, we address the question of how efficiently Semantic Web (SW) reasoners perform in processing (classifying and querying) taxonomies of enormous size and whether it is possible to improve on existing implementations. We use a bit-vector encoding technique to implement taxonomic concept classification and Boolean-query answering. We describe the technique we have used, which achieves high performance, and discuss implementation issues. We compare the performance of our implementation with those of the best existing SW reasoning systems over several very large taxonomies under the exact same conditions for so-called TBox reasoning. The results show that our system is among the best for concept classification and several orders-of-magnitude more efficient in terms of response time for query answering. We present these results in detail and comment them. We also discuss pragmatic issues such as cycle detection and decoding.
Similar content being viewed by others
Notes
In Semantic Web lingo, a Knowledge Base (KB) is defined as a formal ontology consisting of two parts (or “boxes”): (1) a Terminological Box (abbreviated as TBox); and, (2) an Assertional Box (abbreviated as ABox). The TBox contains the formal axioms that define the structure and semantic properties of the actual instance data; which instance data constitute the ABox. In Database lingo, the TBox corresponds to the schema and the ABox to the actual data.
Description Logics in the \(\mathcal {EL}\)-family are weaker versions that provide existential roles (∃r.C) but no universal roles (∀r.C).
To the best of our knowledge, this is the latest best bound as of 2011. However, these algorithms are not implementable due to prohibitive size of constants. For more recent work on parallelizing Strassen’s algorithm, see (Ballard et al. 2014). This, however, requires special harware (GPGPUs).
op. cit., pages 125–126.
See Section 4.2.
See Section 3.3 for a discussion concernining this point.
We use “ &” to denote “and,” and “ |” to denote “or.”
We implemented such a facility—see Appendix.
The ordering on bit-vector codes is simply defined as c 1≤c 2 iff c 1=c 1 & c 2.
In the same manner as we have noticed that SnoRocket does.
Recall that a bit vector is written with its lowest bit to the right.
In what follows, we shall use the “dot” notation of object-oriented methods to denote all functions or operations on codes.
Note that i≠j implies necessarily that i<j (by Condition (1) and since n<m).
Actually, the java.util.TreeSet does maintain a doubly-linked list for its elements in order to ensure its two ordered iterators (ascending and descending). But this structure is not made public and one cannot splice in new elements from a given found element. But it is a simple matter to modify the source code of java.util.TreeSet.java and adapt it to what is needed.
In fact, they see that only as a possible a posteriori optimization, but one that would cause their optimal spanning-tree finding algorithm to be incorrect if applied incrementally while it is executed.
References
Agrawal, R., Borgida, A., & Jagadish, H. V. (1989). Efficient management of transitive relationships in large data and knowledge bases. In J. Clifford, B.G. Lindsay, & D. Maier (Eds.), Proceedings of the ACM SIGMOD International Conference on Management of Data. [Available online http://dbs.informatik.uni-halle.de/Lehre/DBS_IIa_SS02/p253-agrawal.pdf]., (Vol. 18(2) pp. 253–262). Portland: ACM, SIGMOD Record.
Aït-Kaci, H. (1991). Warren’s Abstract Machine: A Tutorial Reconstruction. Cambridge: The MIT Press. [Available online http://wambook.sourceforge.net/].
Aït-Kaci, H. (2007). Data models as constraint systems—A key to the Semantic Web. Constraint Processsing Letters, 1(1), 33–88. [Available online http://www.cs.brown.edu/people/pvh/CPL/Papers/v1/hak.pdf].
Aït-Kaci, H., & Amir, S. (2013). Classifying and querying very large taxonomies—a comparative study to the best of our knowledge. \(\mathcal {CEDAR}\) Technical Report Number 2, \(\mathcal {CEDAR}\) Project, LIRIS, Département d’Informatique, Université Claude Bernard Lyon 1, Villeurbanne, France. [Available online http://cedar.liris.cnrs.fr/papers/ctr2.pdf].
Aït-Kaci, H., Boyer, R., Lincoln, P., & Nasr, R. (1989). Efficient implementation of lattice operations. ACM Transactions on Programming Languages and Systems, 11 (1), 115–146. [Available online http://hassan-ait-kaci.net/pdf/encoding-toplas-89.pdf].
Aït-Kaci, H, & Di Cosmo, R. (1993). Compiling order-sorted feature term unification Vol. 7. France: PRL Technical Note, Digital Paris Research Lab, Rueil-Malmaison. [Available online http://hassan-ait-kaci.net/pdf/PRL-TN-7.pdf].
Amir, S., & Aït-Kaci, H. (2013). CEDAR: a fast taxonomic reasoner based on lattice operations. In Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia (pp. 9–12).
Amir, S., & Aït-Kaci, H. (2014a). CEDAR: efficient reasoning for the semantic web. In Tenth International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2014 (pp. 157–163), Marrakech, Morocco.
Amir, S., & Aït-Kaci, H. (2014b). Design and implementation of an efficient semantic web reasoner. CEDAR Technical Report Number 12, \(\mathcal {CEDAR}\) Project, LIRIS, Département d’Informatique, Université Claude Bernard Lyon Vol. 1, Villeurbanne, France. [Available online http://cedar.liris.cnrs.fr/papers/ctr12.pdf].
Baader, F., Brandt, S., & Lutz, C. (2005). Pushing the \(\mathcal {EL}\) envelope. In L. P. Kaelbling, & A. Saffiotti (Eds.), Proceedings of the 19th International Joint Conference on Artificial Intelligence (pp. 364–369). Edinburgh: IJCAI’05, Morgan Kaufmann Publishers. [Available online http://www.ijcai.org/papers/0372.pdf]
Baader, F., Lutz, C., & Suntisrivaraporn, B. (2006). CEL—a polynomial-time reasoner for life science ontologies. In U. Furbach, & N. Shankar (Eds.), Proceedings of the 3rd international joint conference on automated reasoning, (Vol. 4130 pp. 287–291). Seattle: IJCAR’06, Springer-Verlag LNAI. [Available online http://lat.inf.tu-dresden.de/research/papers/2006/BaaLutSun-IJCAR-06.pdf].
Ballard, G., Demmel, J., Holtz, O., & Schwartz, O. (2014). Communication costs of Strassen’s matrix multiplication. Communications of the ACM, 57(2), 107–114. [Available online https://aspire.eecs.berkeley.edu/wp/wp-content/uploads/2014/02/Communication-Costs-of-Strassen%E2%80%99s-Matrix-Multiplication.pdf].
Coppersmith, D., & Winograd, S. (1990). Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9(3), 251–280. [Available online http://www.sciencedirect.com/science/article/pii/S0747717108800132].
Cousot, P. (1996). Abstract interpretation. ACM Computing Surveys—Symposium on Models of Programming Languages and Computation, 28(2), 324–328. Tutorial summary: [Available online http://www.di.ens.fr/~cousot/AI/IntroAbsInt.html].
Fikes, R., Hayes, P., & Horrocks. I. (2004). OWL-QL—a language for deductive query answering on the Semantic Web. Journal of Web Semantics, 2(1), 19–29. [Available online http://www.sciencedirect.com/science/article/pii/S1570826804000137].
Fischer, M. J., & Meyer, A. R. (1971). Boolean matrix multiplication and transitive closure. In Proceedings of the 12th annual symposium on switching and automata theory, SWAT’71, (pp. 129–131), Washington: IEEE Computer Society. [Available online http://rjlipton.files.wordpress.com/2009/10/matrix1971.pdf].
Haarslev, V., Hidde, K., Möller, R., & Wessel, M. (2011). The RacerPro knowledge representation and reasoning system. Semantic Web Journal, 1, 1–11. [Available online http://www.semantic-web-journal.net/sites/default/files/swj109_3.pdf].
Haarslev, V., & Möller, R. (2001). RACER system description. In R. Gore, A. Leitsch, & T. Nipkow (Eds.), Proceedings of the 1st international joint conference on automated reasoning (pp. 701–706). Siena: IJCAR’01, Springer-Verlag. [Available online https://www.ifis.uni-luebeck.de/~moeller/papers/2001/HaMo01a.pdf].
Hoare, C. A. R. (1961). Algorithm 63: Partition, algorithm 64: Quicksort. Communications of the ACM, 4(7), 321–321. [Available online http://comjnl.oxfordjournals.org/content/5/1/10.full.pdf].
Horrocks, I., & Sattler. U. (2007). A tableau decision procedure for \(\mathcal {SHOIQ}\). Journal of Automated Reasoning, 39 (3), 249–276. [Available online http://www.cs.ox.ac.uk/ian.horrocks/Publications/download/2007/HoSa07a.pdf].
Kazakov, Y. (2009). Consequence-driven reasoning for horn \(\mathcal {SHIQ}\) ontologies. In C. Boutilier (Ed.), Proceedings of the 21st international conference on artificial intelligence (pp. 2040–2045). Pasadena: IJCAI’09, Association for the Advancement of Artificial Intelligence. [Available online http://ijcai.org/papers09/Papers/IJCAI09-336.pdf].
Kazakov, Y., Krötzsch, M., & Simančík, F. (2011). Unchain my \(\mathcal {EL}\) reasoner. In R. Rosati, S. Rudolph, & M. Zakharyaschev (Eds.), Proceedings of the 24th international workshop on description logics. Barcelona: DL’11, CEUR Workshop Proceedings. [Available online http://ceur-ws.org/Vol-745/paper_54.pdf].
Lawley, M. J., & Bousquet, C. (2010). Fast classification in Protegé́: Snorocket as an OWL 2 EL reasoner. In T. Meyer, M. A. Orgun, & K. Taylor (Eds.), Proceedings of the 2nd Australasian ontology workshop: Advances in ontologies. (pp. 45–50). Adelaide: AOW’10, ACS. [Available online http://krr.meraka.org.za/~aow2010/Lawley-etal.pdf].
Manna, Z., & Waldinger, R. (1991). Fundamentals of deductive program synthesis. In A. Apostolico, & Z. Galil (Eds.), Combinatorial algorithms on words, NATO ISI Series: Springer. [Available online http://lara.epfl.ch/~kuncak/t/MannaWaldingerTSE.pdf].
Motik, B., Shearer, R., & Horrocks, I. (2009). Hypertableau reasoning for description logics. Journal of Artificial Intelligence Research, 36(1), 165–228. [Available online https://www.jair.org/media/2811/live-2811-4689-jair.pdf].
Shearer, R., Motik, B., & Horrocks, I. (2008). HermiT: A highly-efficient OWL reasoner. In U. Sattler, & C. Dolbear (Eds.), Proceedings of the 5th international workshop on OWL experiences and directions Karlsruhe: OWLED’08, CEUR Workshop Proceedings. [Available online http://www.cs.ox.ac.uk/ian.horrocks/Publications/download/2008/ShMH08b.pdf].
Sirin, E., Parsia, B., Grau, B. C., Kalyanpur, A., & Katz, Y. (2007). Pellet: A practical OWL-DL reasoner. Journal of Web Semantics, 5(2), 51–53. This is a summary; full paper: [Available online http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.5433&rep=rep1&type=pdf].
Stothers, A. (2010). On the complexity of matrix multiplication. Edinburgh: PhD thesis, University of Edinburgh. [Available online http://www.maths.ed.ac.uk/assets/files/pgrexternalfiles/theses/probability/stothers.pdf].
Strassen, V. (1969). Gaussian elimination is not optimal. Numerische Mathematik, 13, 354–356.
Thomas, E., Pan, J. Z., & Ren, Y. (2010). TrOWL: Tractable OWL 2 reasoning infrastructure. In L. Aroyo, G. Antoniou, E. Hyvnen, A. ten Teije, H. Stuckenschmidt, L. Cabral, & T. Tudorache (Eds.), Proceedings of the 7th extended semantic web conference (pp. 431–435). Heraklion: ESWC’10, Springer. [Available online http://homepages.abdn.ac.uk/jeff.z.pan/pages/pub/TPR2010.pdf].
Tsarkov, D., & Horrocks, I. (2006). FaCT++ description logic reasoner: System description. In U. Furbach, & N. Shankar (Eds.), Proceedings of the 3rd international joint conference on automated reasoning. [Available online http://www.cs.ox.ac.uk/Ian.Horrocks/Publications/download/2006/TsHo06a.pdf]. (pp. 292–297). Seattle: IJCAR’06, Springer.
Warren Jr, H. S. (1975). A modification of Warshall’s algorithm for the transitive closure of binary relations. Communications of the ACM, 18(4), 218–220. [Available online http://dl.acm.org/citation.cfm?id=360746].
Warshall, S. (1962). A theorem on Boolean matrices. Journal of the ACM, 9(1), 11–12.
Williams, V. V. (2011). Breaking the Coppersmith-Winograd barrier: University of California at Berkeley and Stanford University. [Available online http://www.cs.rit.edu/~rlc/Courses/Algorithms/Papers/matrixMult.pdf].
Acknowledgments
The authors wish to thank Prof. Mohand-Saïd Hacid and the anonymous referees for constructive feedback.
Conflict of interests
The authors declare that they have no conflict of interest.
Author information
Authors and Affiliations
Corresponding author
Additional information
Funding
This work was carried out as part of the 𝓒𝓔𝓓𝓐𝓡 Project (Constraint Event-Driven Automated Reasoning) under the Agence Nationale de la Recherche (ANR) Chair of Excellence grant N ° ANR-12-CHEX-0003-01.
Appendices
Appendix
In Section A, we give an overview of an implementation specification for representing very large bit vectors, reducing memory-space consumption while retaining efficient operations. In our experiments, this alternative code representation was used only for saving an encoded taxonomy on disk and reloading it as a pre-encoded order. But for taxonomies of even larger size than those used in our experiments, it could be used for lattice operations as well.
A Compact codes
While the foregoing sections present a method for encoding elements of a partially ordered set based on transitive closure, the data structure it relies on is that of a binary word—i.e., a bit vector. With such a structure, all boolean operations—and, or, not—are thus very efficient. This representation also eases computation of the transitive closure since setting a bit on or off is trivially accommodated.
However, while this representation is convenient and time-efficient for relatively small posets of the order of a few hundred elements, it quickly becomes space-inefficient for large posets of hundreds of thousands, or millions of elements.
In what follows, we define an alternative representation of indexed bit sets that offers the advantage of being more compact than bit vectors while retaining time-efficient boolean and bit-setting operations. It is also the format we use to save/load encoded taxonomies on/from disk. In Section A.1, the basic data structure is defined. In Section A.2, bit setting and unsetting operations are defined. In Section A.3, the three boolean operations—conjunction, disjunction, and negation—are defined. In Section A.4, some implementation considerations are discussed.
1.1 A.1 Bit code representation
The idea is intuitively simple. It consists of representing a bit vector as a finite array of k (\(k\in \mathbb {N}\)) pairs of integer indices 〈l i ,u i 〉, for \(i=0,\dots ,k-1\), such that, for all indices \(i=0,\dots ,k-2\):
We shall refer to such a sequence k pairs, \(k\in \mathbb {N}\), { 〈l i ,u i 〉 | i=0,…,k−1 } as a compact code. For k=0, this is written as the empty sequence {}.
Given a compact code representation of a bit vector V, each pair 〈l,u〉 represents a maximal contiguous sequence of 1’s (hereafter referred to as a “packet”) in V. Thus, the i-th packet of a bit vector is represented as the pair of indices 〈l i ,u i 〉 such that l i is the index of the lowest bit in the packet, and u i is the index of the first 0-bit following the packet.
For example, the bit vector 0011111001111000000110000 corresponds to the compact code (i.e., sequence of packet pairs):Footnote 27 { 〈4,6〉,〈12,16〉,〈18,23〉 }.
The empty bit vector (containing all 0’s) is represented as the empty sequence {}. The length of a bit vector represented by a compact code sequence of k pairs (or packets) is u k . The size of a compact code C of k pairs (or packets) is k (i.e., its number of packets).
1.2 A.2 Bit operations
Let C={ 〈l i ,u i 〉 | i=0,…,k } be a compact code of k packets (k>0). Given a number \(n\in \mathbb {N}\), and a compact code C of k packets as defined above, we say that:Footnote 28
-
n is within a packet of C iff ∃i∈[0,k−1] such that l i ≤n<u i —in which case we shall write C.p a c k e t(n)=i;
-
n is between packets of C iff either one of the three statements holds:
-
1.
n<l 0; or,
-
2.
u k−1≤n; or,
-
3.
∃i∈[0,k−2] such that u i ≤n<l i+1.
-
1.
If a number n is between packets of a compact code C of size k, we define two functions C.p r e v(n) and C.n e x t(n) for each of the three possible respective cases above as follows (where the symbol ‘?’ means “undefined”):
-
1.
\(C.\mathbf {prev}(n) \overset {def}{=\!\!=}\ \texttt {?}\) and \(C.\mathbf {next}(n) \overset {def}{=\!\!=}\ l_{0}\);
-
2.
\(C.\mathbf {prev}(n) \overset {def}{=\!\!=}\ u_{k}\) and \(C.\mathbf {next}(n) \overset {def}{=\!\!=}\ \texttt {?}\);
-
3.
\(C.\mathbf {prev}(n) \overset {def}{=\!\!=}\ u_{i}\) and \(C.\mathbf {next}(n) \overset {def}{=\!\!=}\ l_{i+1}\).
For such a number n, we say that:
-
n is left-adjacent in C if n=C.n e x t(n)−1;
-
n is right-adjacent in C if n=C.p r e v(n);
-
n is adjacent in C if it is both left-adjacent and right-adjacent in C.
Note that if n is between packets and adjacent, this necessarily means that the two packets on each side are only separated by a single 0-bit (the bit in position n in the denoted bit vector).
N.B.: In all the compact code expressions to follow, we use the implicit convention that a packet with undefinable bounds is simply omitted. Thus, we will always use the notation { 〈l 0,u 0〉,…,〈l k−1,u k−1〉 } to denote a compact code, where k≥0 up to the above conventions regardless of the actual number of packets. For example, for k=0 this will correspond to the empty code {}, and for k=1, this will correspond to the single-packet code { 〈l 0,u 0〉 }.
We define the following bit-setting operations on C. These methods operate “in place” by modifying a code C that invokes them.
-
C.s e t(n), for \(n\in \mathbb {N}\), which sets the n-th bit of the bit vector denoted by C to 1.
-
C.s e t(n,m), for \(n,m\in \mathbb {N}, n<m\), which sets to 1 all the bits from position n (inclusive) to position m (exclusive) of the bit vector denoted by C.
-
C.u n s e t(n), for \(n\in \mathbb {N}\), which sets the n-th bit of the bit vector denoted by C to 0.
-
C.u n s e t(n,m), for \(n,m\in \mathbb {N}, n<m\), which sets to 0 all the bits from position n (inclusive) to position m (exclusive) of the bit vector denoted by C.
For m≤n, both C.s e t(n,m) and C.u n s e t(n,m) are no_ops—i.e., they leave C unchanged. Since s e t(n) is equivalent to s e t(n,n+1), we will just give the methods for s e t(n,m) and similarly for u n s e t(n).
1.2.1 A.2.1 Bit setting
There are four cases to consider for which performing C.s e t(n,m) modifies C as follows.
-
1.
If n is within a packet in C (say, C.p a c k e t(n)=i) and m is within a packet in C (say, C.p a c k e t(m)=j), then, if i=j, C.s e t(n,m) leaves C unchanged. Else (if i<j),Footnote 29 then C becomes:
$$\{~\ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}. $$ -
2.
If n is within a packet in C (say, C.p a c k e t(n)=i) and m is between packets in C, then C becomes:
$$ \left\{ \begin{array}{l} \{~ \ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}\\ ~~~\text{if}~m~\text{is~left-adjacent~in}~C~\text{and}~C.\mathbf{next}(m) = l_{j}; \\\\ \{~\ldots, \langle{l_{i},m}\rangle, \ldots~\}\\ ~~~\text{otherwise}. \end{array} \right. $$(2) -
3.
If n is between packets in C and m is within a packet in C (say, C.p a c k e t(m)=j), then C becomes:
$$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}\\ ~~~\text{if}~n~\text{is~right-adjacent~in}~C~\text{and}~C.\mathbf{prev}(n) = u_{i}; \\\\ \{~\ldots, \langle{n,u_{j}}\rangle, \ldots~\}\\ ~~~\text{otherwise}. \end{array} \right. $$(3) -
4.
If both n and m are between packets in C, then C becomes:
$$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i},u_{j}}\rangle, \ldots~\}\\ ~~~\text{if}~n~\text{is~right-adjacent~in}~C~\text{and}~C.\mathbf{prev}(n) = u_{i},~\text{and}\\ ~~~\text{if}~m~\text{is~left-adjacent~in}~C~\text{and}~C.\mathbf{next}(m) = l_{j}; \\\\ \{~\ldots, \langle{l_{i},m}\rangle, \ldots~\}\\ ~~~\text{if}~n~\text{is~right-adjacent~in}~C~\text{and}~C.\mathbf{prev}(n) = u_{i},~\text{and} \\ ~~~\text{if}~m~\text{is~not~left-adjacent~in}~C; \\\\ \{~\ldots, \langle{n,u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n~\text{is~not~right-adjacent~in}~C,~\text{and}\\ ~~~\text{if}~m~\text{is~left-adjacent~in}~C~\text{and}~C.\mathbf{next}(m) = l_{j}; \\\\ \{~\ldots, \langle{n,m}\rangle, \ldots~\}\\ ~~~\text{otherwise}. \end{array} \right. $$(4)
1.2.2 A.2.2 Bit unsetting
Here again, there are four cases to consider for which performing C.u n s e t(n,m) modifies C as follows.
-
1.
If both n and m are between packets in C: if C.p r e v(n)=C.p r e v(m) (or, equivalently, if C.n e x t(n)=C.n e x t(m)), then C.u n s e t(n,m) leaves C unchanged; else, C becomes:
$$\{~\ldots, \langle{l_{i},C.\mathbf{prev}(n)}\rangle,\langle{C.\mathbf{next}(m),u_{j}}\rangle, \ldots~\}. $$ -
2.
If n is within a packet in C (say, C.p a c k e t(n)=i) and m is between packets in C, then C becomes:
$$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i-1},u_{i-1}}\rangle, \langle{C.\mathbf{next}(m),u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n=l_{i},~\text{where}~C.\mathbf{next}(m)=l_{j}; \\ \\ \{~\ldots, \langle{l_{i},n}\rangle, \langle{C.\mathbf{next}(m),u_{j}}\rangle, \ldots~\} \\~~~ \text{else}~(\textit{i.e.},~\text{if}~n>l_{i}),~\text{where}~C.\mathbf{next}(m)=l_{j}. \end{array} \right. $$(5) -
3.
If n is between packets in C and m is within a packet in C (say, C.p a c k e t(m)=j), then C becomes:
$$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i},C.\mathbf{prev}(n)}\rangle, \langle{l_{j+1},u_{j+1}}\rangle, \ldots~\} \\ ~~~\text{if}~m=u_{j}-1, \text{where}~C.\mathbf{prev}(m)=u_{i}; \\\\ \{~\ldots, \langle{l_{i},C.\mathbf{prev}(n)}\rangle, \langle{m,u_{j}}\rangle, \ldots~\} \\ ~~~\text{else}~(i.e., if m<u_{j}-1),~\text{where}~C.\mathbf{prev}(m)=u_{i}. \end{array} \right. $$(6) -
4.
If both n is within a packet (say, C.p a c k e t(n)=i), and m is within a packet (say, C.p a c k e t(m)=i) in C, then C becomes:
$$ \left\{ \begin{array}{l} \{~\ldots, \langle{l_{i-1},u_{i-1}}\rangle, \langle{l_{j+1},u_{j-1}}\rangle, \ldots~\} \\ ~~~\text{if}~n=l_{i},~\text{and} \\ ~~~\text{if}~m=u_{j-1}; \\\\ \{~\ldots, \langle{l_{i-1},u_{i-1}}\rangle, \langle{m,u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n=l_{i},~\text{and} \\ ~~~\text{if}~m<u_{j-1}; \\\\ \{~\ldots, \langle{l_{i},n}\rangle, \langle{l_{j+1},u_{j-1}}\rangle, \ldots~\} \\ ~~~\text{if}~n>l_{i},~\text{and} \\ ~~~\text{if}~m=u_{j-1}; \\\\ \{~\ldots, \langle{l_{i},n}\rangle, \langle{m,u_{j}}\rangle, \ldots~\} \\ ~~~\text{if}~n>l_{i},~\text{and} \\ ~~~\text{if}~m<u_{j-1}. \end{array} \right. $$(7)
1.3 A.3 Boolean operations
Let:
and:
be two compact code pair sequences, with k≥0 and k ′≥0.
1.3.1 A.3.1 Conjunction
Invoking C.a n d(C ′) will modify C according to C ′ by unsetting all the bits in C that are between packets in C ′, leaving C ′ unchanged.
If C={}, then C is left unchanged; else, if C ′={}, then C becomes {}.
Else (i.e., if k>0 and k ′>0), C is modified by invoking:
-
C.u n s e t(0,l0′); and,
-
C.u n s e t(u i′,l i+1′), for i=0 up to i=k ′−1; and,
-
C.u n s e t(u k ′−1′,u k−1).
Note that in practice, when proceeding in the above order, as soon the first argument of any u n s e t(…) is greater than or equal to u k−1, there is no need to perform the unsetting nor proceed any further.
1.3.2 A.3.2 Disjunction
Invoking C.o r(C ′) will modify C according to C ′ by setting all the bits in C that are within packets in C ′, leaving C ′ unchanged.
If C ′={}, then C is left unchanged; else, if C={}, then C becomes (a copy of) C ′.
Else (i.e., if k>0 and k ′>0), C is modified by invoking:
-
C.s e t(l i′,u i′), for i=0 up to i=k ′−1.
1.3.3 A.3.3 Negation
Since a bit vector is open-ended, we may define its negation only up to a length at least greater than its highest 1-bit position. This operation is denoted as C.n o t(n). Thus, {}.n o t(n), is undefined for any n≥0.
Otherwise, for a non-empty code C={ 〈l 0,u 0〉,…,〈l k−1,u k−1〉 } and n≥u k−1, C.n o t(n) modifies C to become:
Again, following our convention, if l 0=0, 〈0,l 0〉 being undefinable, the first element of C.n o t(n) is 〈u 0,l 1〉. Similarly, if n=u k−1, then 〈u k−1,n〉 is undefinable and the last element of C.n o t(n) is 〈u k−2,l k−1〉.
1.4 A.4 Implementation considerations
We need to come up with a data structure for representing a compact code that would enable retaining maximal efficiency in the bit setting and unsetting operations, and hence in the boolean operations that rely on them.
Most frequently used operations on such a data structure C for an integer n are:
-
C.p a c k e t(n)—for n inside a packet in C, returning that packet number;
-
C.p r e v(n)—for n between packets in C, returning the upper index of the packet preceding n;
-
C.n e x t(n)—for n between packets in C, returning the lower index of the packet following n;
-
adding/removing a packet.
Because the elements of a code sequence are ranges rather than integers, one cannot expect hashed Ø(1) time access to find out whether a given integer lies within or between packets. So structures such as defined by the Java classes java.util.HashSet or java.util.LinkedHashSet cannot be used.
In order to make these operations at most \(\mathcal {O}(\log (k))\) time for a compact code of size k, one way is to represent a compact code as a balanced binary tree of pairs of bit position spans 〈l i ,u i 〉, taking advantage of the ordering imposed by condition (1).
Thus, the java.util.TreeSet class looks like a convenient choice, since it offers the required data structure properties in addition to defining methods such as first, last, higher, lower, add, remove, etc., as well as an order-respecting iterator.
On the other hand, the java.util.TreeSet is missing a replace method—which is critically needed for setting and unsetting bits. It is also missing an insert method that splices a new sequence of packet pairs into an existing compact code, which may also be often used. One must resort to several add/remove method invocations to replace or insert elements, which incur new searches (and possible intermediate rebalancing of the tree) each time. This is a waste since replacing and inserting can be done in \(\mathcal {O}(1)\) time when having already found the required elements, and only one (final) \(\mathcal {O}(\log (k))\) tree rebalancing.
Hence, rather than relying on the ready-to-use java.util.TreeSet class, it may be more beneficial to implement a new specific class for a compact code as a doubly linked list and a balanced binary tree adapted for the specific nature of its pair elements. This would make transparent the double links of each pair element and ease replacement and insertion.Footnote 30
1.5 A.5 Discussion
A similar data structure was proposed by researchers in data and knowledge bases in (Agrawal et al. 1989). However, the authors did not use that representation for lattice operations as we do here. Instead, they focused on using it for obtaining more compact range-sequence codes for the transitive closure of the “is-a” relation of a taxonomy. That representation is equivalent to the one we specify here and to the one in (Aït-Kaci et al. 1989). Contrary to (Aït-Kaci et al. 1989), they define an element’s code as the union of index ranges from the post-order arrangement of the a spanning tree of the “is-parent-of” relation of a taxonomy. Each concept in the taxonomy (i.e., each element in the poset) of post-order index j is then encoded as the interval 〈i,j〉 where i is the smallest post-order index of all its descendants. Although they did not do it, it is easy to show that their representation is equivalent to bit vectors. But they did not specify lattice operations on their data structures as we do for ours in this document. What they focused on was minimizing the total number of packets in range codes. In order to do so, they suggest generating codes based on the “optimal” spanning tree for generating the most compact set of codes. The data structure and algorithm for what they call “compressed transitive closure” do not maintain dynamic interval consistency caused by potential adjacency as we do here.Footnote 31 Although they do cite (Aït-Kaci et al. 1989), Agrawal et al. do so only in the conclusion as they had just noticed its publication. They suggest that their approach and that exposed in Aït-Kaci et al. (1989) might be combined for processing large taxonomies. As far as we know, no follow-up on this suggestion was carried out.
It is clear how the work of (Agrawal et al. 1989), although orthogonal to ours, could be adapted to our needs as well in order to improve its space consumption. However, it is to be noted that their code-compaction method requires a topologically ordered poset. For very large taxonomies (over 1 million elements) the price of sorting the taxonomy might be worth spending only for once-for-all prepocessing prior to query time (Aït-Kaci and Amir 2013; Amir and Aït-Kaci 2013). Also, the question of incrementality is not addressed.
Finally, although this work has been motivated for obtaining a compact representation of binary codes encoding a partial order, it comes as evident that the data structure, and operations on it, specified in this appendix can represent any set of integers (or integer-indexed set) seen as a sequence of intervals. Set intersection is realized as the conjunction described in Section A.3.1; set union as the disjunction described in Section A.3.2; and, set complementation as the negation described in Section A.3.3. Therefore, it can readily be used for this purpose as well.
Rights and permissions
About this article
Cite this article
Aït-Kaci, H., Amir, S. Classifying and querying very large taxonomies with bit-vector encoding. J Intell Inf Syst 48, 1–25 (2017). https://doi.org/10.1007/s10844-015-0383-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-015-0383-2