skip to main content
10.1145/3018743.3018772acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Public Access

Grammar-aware Parallelization for Scalable XPath Querying

Published:26 January 2017Publication History

ABSTRACT

Semi-structured data emerge in many domains, especially in web analytics and business intelligence. However, querying such data is inherently sequential due to the nested structure of input data. Existing solutions pessimistically enumerate all execution paths to circumvent dependencies, yielding sub-optimal performance and limited scalability.

This paper presents GAP, a parallelization scheme that, for the first time, leverages the grammar of the input data to boost the parallelization efficiency. GAP leverages static analysis to infer feasible execution paths for specific con- texts based on the grammar of the semi-structured data. It can eliminate unnecessary paths without compromising the correctness. In the absence of a pre-defined grammar, GAP switches into a speculative execution mode and takes potentially incomplete grammar extracted either from prior inputs. Together, the dual-mode GAP reduces the execution paths from all paths to a minimum, therefore maximizing the parallelization efficiency and scalability. The benefits of path elimination go beyond reducing extra computation -- it also enables the use of more efficient data structures, which further improves the efficiency. An evaluation on a large set of standard benchmarks with diverse queries shows that GAP yields significant efficiency increase and boosts the speedup of the state-of-the-art from 2.9X to 17.6X on a 20-core ma- chine for a set of 200 queries.

References

  1. D. Arroyuelo, F. Claude, S. Maneth, V. Makinen, G. Navarro, K. Nguyen, J. Siren, and N. Valimaki. Fast in-memory xpath search using compressed indexes. Software: Practice and Experience, 45(3):399--434, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Avila-Campillo, T. J. Green, A. Gupta, M. Onizuka, D. Raven, and D. Suciu. Xmltk: An xml toolkit for scalable xml stream processing. 2002.Google ScholarGoogle Scholar
  3. P. Boncz, T. Grust, M. Van Keulen, S. Manegold, J. Rittinger, and J. Teubner. Monetdb/xquery: a fast xquery processor powered by a relational engine. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 479--490. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Bordawekar, L. Lim, A. Kementsietsidis, and B. W. Kok. Statistics-based parallelization of xpath queries in shared memory systems. In EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22--26, 2010, Proceedings, pages 159--170, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chen, H. Li, J. Tatemura, W. Hsiung, D. Agrawal, and K. S. Candan. Twig-stack: Bottom-up processing of generalized-tree-pattern queries over XML documents. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, pages 283--294, 2006.Google ScholarGoogle Scholar
  6. J. Clark, C. Cooper, and F. Drake. The expat xml parser, 2011.Google ScholarGoogle Scholar
  7. DBLP Team. Welcome to DBLP. http://dblp.uni-trier.de/, 2016. Retrived: 2016-07-01.Google ScholarGoogle Scholar
  8. F. S. de Boer, M. M. Bonsangue, J. Jacob, A. Stam, and L. Van der Torre. Enterprise architecture analysis with xml. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pages 222b--222b. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. developers program. What is the ebay api? https://go.developer.ebay.com/what-ebay-api, 2016. Retrived: 2016-07-01.Google ScholarGoogle Scholar
  10. Y. Diao, P. Fischer, M. J. Franklin, and R. To. Yfilter: Efficient and scalable filtering of xml documents. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 341--342. IEEE, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Diao, P. M. Fischer, M. J. Franklin, and R. To. Yfilter: Efficient and scalable filtering of XML documents. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 341--342, 2002. Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Dunkels et al. Efficient application integration in ip-based sensor networks. In Proceedings of the First ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings, pages 43--48. ACM, 2009.Google ScholarGoogle Scholar
  13. Facebook. Public feed API. https://developers.facebook.com/docs/public_feed, 2016. Retrieved: 2016-07-01.Google ScholarGoogle Scholar
  14. M. Franceschet. Xpathmark: Functional and performance tests for xpath. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2007.Google ScholarGoogle Scholar
  15. F. Galiegue and K. Zyp. Json schema: Core definitions and terminology. Internet Engineering Task Force (IETF), 2013.Google ScholarGoogle Scholar
  16. A. Gibbons and W. Rytter. Efficient parallel algorithms. Cambridge University Press, 1989.Google ScholarGoogle Scholar
  17. A. Gibbons and W. Rytter. Optimal parallel algorithms for dynamic expression evaluation and context-free recognition. Information and Computation, 81(1):32--45, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata. In International Conference on Database Theory, pages 173--189. Springer, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing XML streams with deterministic automata. In Database Theory - ICDT 2003, 9th International Conference, Siena, Italy, January 8--10, 2003, Proceedings, pages 173--189, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  20. D. Halstead. What sort of network and storage setup will be required to ingest the entire twitter firehose for 1 year. http://goo.gl/kFXDH. Retrieved: 2016-07-01.Google ScholarGoogle Scholar
  21. V. Josifovski, M. Fontoura, and A. Barta. Querying XML streams. VLDB J., 14(2):197--210, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Kapoulkine. pugixml: a light-weight, simple and fast xml parser for cGoogle ScholarGoogle Scholar
  23. with xpath support. http://pugixml.org/. Retrieved: 2016-07-01.Google ScholarGoogle Scholar
  24. L. Libkin, W. Martens, and D. Vrgoc. Querying graph databases with xpath. In Proceedings of the 16th International Conference on Database Theory, pages 129--140. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Liu, J. Feng, G. Li, Q. Qian, and J. Li. Parallel structural join algorithm on shared-memory multi-core systems. In The Ninth International Conference on Web-Age Information Management, WAIM 2008, July 20--22, 2008, Zhangjiajie, China, pages 70--77, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Lu, K. Chiu, and Y. Pan. A parallel approach to xml parsing. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing, GRID '06, pages 223--230, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. ACM SIGPLAN Notices, 49(8):219--232, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Mitra, M. Vieira, P. Bakalov, W. Najjar, and V. Tsotras. Boosting xml filtering with a scalable fpga-based architecture. arXiv preprint arXiv:0909.1781, 2009.Google ScholarGoogle Scholar
  29. T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finite-state machines. In ASPLOS '14: Proceedings of 19th International Conference on Architecture Support for Programming Languages and Operating Systems. ACM Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Nijholt. The cyk approach to serial and parallel parsing. 1991.Google ScholarGoogle Scholar
  31. P. Ogden, D. Thomas, and P. Pietzuch. Scalable XML query processing using parallel pushdown transducers. Proceedings of the VLDB Endowment, 6(14):1738--1749, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. O'Riain, E. Curry, and A. Harth. Xbrl and open data for global financial ecosystems: A linked data approach. International Journal of Accounting Information Systems, 13(2):141--162, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  33. S. Pal, V. Parikh, V. Zolotov, L. Giakoumakis, and M. Rys. Xml best practices for microsoft sql server 2005. Retrieved, 11:2004, 2004.Google ScholarGoogle Scholar
  34. Y. Pan, W. Lu, Y. Zhang, and K. Chiu. A static load-balancing scheme for parallel xml parsing on multicore cpus. In Proc. of the 7th International Symposium on Cluster Computing and the Grid (CCGRID), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Qiu, Z. Zhao, and B. Ren. Microspec: Speculation-centric fine-grained parallelization for fsm computations. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on, pages 221--233. IEEE, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Sakakibara. Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76(2--3):223--242, 1990.Google ScholarGoogle Scholar
  37. Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97(1):23--60, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Satta and O. Stock. Bidirectional context-free grammar parsing for natural language processing. Artificial Intelligence, 69(1):123--164, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Schor, P. Sommer, and R. Wattenhofer. Towards a zero-configuration wireless sensor network architecture for smart buildings. In Proceedings of the First ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings, pages 31--36. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Shah, P. Rao, B. Moon, and M. Rajagopalan. A data parallel algorithm for xml dom parsing. Database and XML Technologies, Lecture Notes in Computer Science, 5679, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. Suciu. XML data repository. http://www.cs.washington.edu/research/xmldatasets/, 2003. Retrieved: 2016-07-01.Google ScholarGoogle Scholar
  42. N. Takashi, T. Kentaro, K. Taura, and J. Tsujii. A parallel cky parsing algorithm on large-scale distributed-memory parallel machines. 1997.Google ScholarGoogle Scholar
  43. A. H. Wang. Don't follow me: Spam detection in twitter. In Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference on, pages 1--10. IEEE, 2010.Google ScholarGoogle Scholar
  44. Z. Y. Yu Wu, Qi Zhang and J. Li. A hybrid parallel processing for xml parsing and schema validation. In Balisage: The Markup Conference 2008.Google ScholarGoogle Scholar
  45. Y. Zhang, Y. Pan, and K. Chiu. A parallel xpath engine based on concurrent NFA execution. In 16th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2010, Shanghai, China, December 8--10, 2010, pages 314--321, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Z. Zhao, M. Bebenita, D. Herman, J. Sun, and X. Shen. Hpar: A practical parallel parser for html--taming html complexities for parallel parsing. ACM Transactions on Architecture and Code Optimization (TACO), 10(4):44, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Z. Zhao and X. Shen. On-the-fly principled speculation for FSM parallelization. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, Istanbul, Turkey, March 14--18, 2015, pages 619--630, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Z. Zhao, B. Wu, and X. Shen. Challenging the "embarrassingly sequential": Parallelizing finite state machine-based computations through principled speculation. In ASPLOS '14: Proceedings of 19th International Conference on Architecture Support for Programming Languages and Operating Systems. ACM Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Z. Zhao, B. Wu, M. Zhou, Y. Ding, J. Sun, X. Shen, and Y. Wu. Call sequence prediction through probabilistic calling automata. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA '14, pages 745--762. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Grammar-aware Parallelization for Scalable XPath Querying

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
                  January 2017
                  476 pages
                  ISBN:9781450344937
                  DOI:10.1145/3018743

                  Copyright © 2017 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 26 January 2017

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  PPoPP '17 Paper Acceptance Rate29of132submissions,22%Overall Acceptance Rate230of1,014submissions,23%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader