ABSTRACT
Semi-structured data emerge in many domains, especially in web analytics and business intelligence. However, querying such data is inherently sequential due to the nested structure of input data. Existing solutions pessimistically enumerate all execution paths to circumvent dependencies, yielding sub-optimal performance and limited scalability.
This paper presents GAP, a parallelization scheme that, for the first time, leverages the grammar of the input data to boost the parallelization efficiency. GAP leverages static analysis to infer feasible execution paths for specific con- texts based on the grammar of the semi-structured data. It can eliminate unnecessary paths without compromising the correctness. In the absence of a pre-defined grammar, GAP switches into a speculative execution mode and takes potentially incomplete grammar extracted either from prior inputs. Together, the dual-mode GAP reduces the execution paths from all paths to a minimum, therefore maximizing the parallelization efficiency and scalability. The benefits of path elimination go beyond reducing extra computation -- it also enables the use of more efficient data structures, which further improves the efficiency. An evaluation on a large set of standard benchmarks with diverse queries shows that GAP yields significant efficiency increase and boosts the speedup of the state-of-the-art from 2.9X to 17.6X on a 20-core ma- chine for a set of 200 queries.
- D. Arroyuelo, F. Claude, S. Maneth, V. Makinen, G. Navarro, K. Nguyen, J. Siren, and N. Valimaki. Fast in-memory xpath search using compressed indexes. Software: Practice and Experience, 45(3):399--434, 2015. Google ScholarDigital Library
- I. Avila-Campillo, T. J. Green, A. Gupta, M. Onizuka, D. Raven, and D. Suciu. Xmltk: An xml toolkit for scalable xml stream processing. 2002.Google Scholar
- P. Boncz, T. Grust, M. Van Keulen, S. Manegold, J. Rittinger, and J. Teubner. Monetdb/xquery: a fast xquery processor powered by a relational engine. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 479--490. ACM, 2006. Google ScholarDigital Library
- R. Bordawekar, L. Lim, A. Kementsietsidis, and B. W. Kok. Statistics-based parallelization of xpath queries in shared memory systems. In EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22--26, 2010, Proceedings, pages 159--170, 2010. Google ScholarDigital Library
- S. Chen, H. Li, J. Tatemura, W. Hsiung, D. Agrawal, and K. S. Candan. Twig-stack: Bottom-up processing of generalized-tree-pattern queries over XML documents. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, pages 283--294, 2006.Google Scholar
- J. Clark, C. Cooper, and F. Drake. The expat xml parser, 2011.Google Scholar
- DBLP Team. Welcome to DBLP. http://dblp.uni-trier.de/, 2016. Retrived: 2016-07-01.Google Scholar
- F. S. de Boer, M. M. Bonsangue, J. Jacob, A. Stam, and L. Van der Torre. Enterprise architecture analysis with xml. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pages 222b--222b. IEEE, 2005. Google ScholarDigital Library
- E. developers program. What is the ebay api? https://go.developer.ebay.com/what-ebay-api, 2016. Retrived: 2016-07-01.Google Scholar
- Y. Diao, P. Fischer, M. J. Franklin, and R. To. Yfilter: Efficient and scalable filtering of xml documents. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 341--342. IEEE, 2002.Google ScholarDigital Library
- Y. Diao, P. M. Fischer, M. J. Franklin, and R. To. Yfilter: Efficient and scalable filtering of XML documents. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 341--342, 2002. Google ScholarCross Ref
- A. Dunkels et al. Efficient application integration in ip-based sensor networks. In Proceedings of the First ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings, pages 43--48. ACM, 2009.Google Scholar
- Facebook. Public feed API. https://developers.facebook.com/docs/public_feed, 2016. Retrieved: 2016-07-01.Google Scholar
- M. Franceschet. Xpathmark: Functional and performance tests for xpath. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2007.Google Scholar
- F. Galiegue and K. Zyp. Json schema: Core definitions and terminology. Internet Engineering Task Force (IETF), 2013.Google Scholar
- A. Gibbons and W. Rytter. Efficient parallel algorithms. Cambridge University Press, 1989.Google Scholar
- A. Gibbons and W. Rytter. Optimal parallel algorithms for dynamic expression evaluation and context-free recognition. Information and Computation, 81(1):32--45, 1989. Google ScholarDigital Library
- T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata. In International Conference on Database Theory, pages 173--189. Springer, 2003.Google ScholarDigital Library
- T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing XML streams with deterministic automata. In Database Theory - ICDT 2003, 9th International Conference, Siena, Italy, January 8--10, 2003, Proceedings, pages 173--189, 2003. Google ScholarCross Ref
- D. Halstead. What sort of network and storage setup will be required to ingest the entire twitter firehose for 1 year. http://goo.gl/kFXDH. Retrieved: 2016-07-01.Google Scholar
- V. Josifovski, M. Fontoura, and A. Barta. Querying XML streams. VLDB J., 14(2):197--210, 2005. Google ScholarDigital Library
- A. Kapoulkine. pugixml: a light-weight, simple and fast xml parser for cGoogle Scholar
- with xpath support. http://pugixml.org/. Retrieved: 2016-07-01.Google Scholar
- L. Libkin, W. Martens, and D. Vrgoc. Querying graph databases with xpath. In Proceedings of the 16th International Conference on Database Theory, pages 129--140. ACM, 2013. Google ScholarDigital Library
- L. Liu, J. Feng, G. Li, Q. Qian, and J. Li. Parallel structural join algorithm on shared-memory multi-core systems. In The Ninth International Conference on Web-Age Information Management, WAIM 2008, July 20--22, 2008, Zhangjiajie, China, pages 70--77, 2008. Google ScholarDigital Library
- W. Lu, K. Chiu, and Y. Pan. A parallel approach to xml parsing. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing, GRID '06, pages 223--230, 2006. Google ScholarDigital Library
- S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. ACM SIGPLAN Notices, 49(8):219--232, 2014. Google ScholarDigital Library
- A. Mitra, M. Vieira, P. Bakalov, W. Najjar, and V. Tsotras. Boosting xml filtering with a scalable fpga-based architecture. arXiv preprint arXiv:0909.1781, 2009.Google Scholar
- T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finite-state machines. In ASPLOS '14: Proceedings of 19th International Conference on Architecture Support for Programming Languages and Operating Systems. ACM Press, 2014. Google ScholarDigital Library
- A. Nijholt. The cyk approach to serial and parallel parsing. 1991.Google Scholar
- P. Ogden, D. Thomas, and P. Pietzuch. Scalable XML query processing using parallel pushdown transducers. Proceedings of the VLDB Endowment, 6(14):1738--1749, 2013. Google ScholarDigital Library
- S. O'Riain, E. Curry, and A. Harth. Xbrl and open data for global financial ecosystems: A linked data approach. International Journal of Accounting Information Systems, 13(2):141--162, 2012. Google ScholarCross Ref
- S. Pal, V. Parikh, V. Zolotov, L. Giakoumakis, and M. Rys. Xml best practices for microsoft sql server 2005. Retrieved, 11:2004, 2004.Google Scholar
- Y. Pan, W. Lu, Y. Zhang, and K. Chiu. A static load-balancing scheme for parallel xml parsing on multicore cpus. In Proc. of the 7th International Symposium on Cluster Computing and the Grid (CCGRID), 2007. Google ScholarDigital Library
- J. Qiu, Z. Zhao, and B. Ren. Microspec: Speculation-centric fine-grained parallelization for fsm computations. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on, pages 221--233. IEEE, 2016.Google ScholarDigital Library
- Y. Sakakibara. Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76(2--3):223--242, 1990.Google Scholar
- Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97(1):23--60, 1992. Google ScholarDigital Library
- G. Satta and O. Stock. Bidirectional context-free grammar parsing for natural language processing. Artificial Intelligence, 69(1):123--164, 1994. Google ScholarDigital Library
- L. Schor, P. Sommer, and R. Wattenhofer. Towards a zero-configuration wireless sensor network architecture for smart buildings. In Proceedings of the First ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings, pages 31--36. ACM, 2009. Google ScholarDigital Library
- B. Shah, P. Rao, B. Moon, and M. Rajagopalan. A data parallel algorithm for xml dom parsing. Database and XML Technologies, Lecture Notes in Computer Science, 5679, 2009. Google ScholarDigital Library
- D. Suciu. XML data repository. http://www.cs.washington.edu/research/xmldatasets/, 2003. Retrieved: 2016-07-01.Google Scholar
- N. Takashi, T. Kentaro, K. Taura, and J. Tsujii. A parallel cky parsing algorithm on large-scale distributed-memory parallel machines. 1997.Google Scholar
- A. H. Wang. Don't follow me: Spam detection in twitter. In Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference on, pages 1--10. IEEE, 2010.Google Scholar
- Z. Y. Yu Wu, Qi Zhang and J. Li. A hybrid parallel processing for xml parsing and schema validation. In Balisage: The Markup Conference 2008.Google Scholar
- Y. Zhang, Y. Pan, and K. Chiu. A parallel xpath engine based on concurrent NFA execution. In 16th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2010, Shanghai, China, December 8--10, 2010, pages 314--321, 2010. Google ScholarDigital Library
- Z. Zhao, M. Bebenita, D. Herman, J. Sun, and X. Shen. Hpar: A practical parallel parser for html--taming html complexities for parallel parsing. ACM Transactions on Architecture and Code Optimization (TACO), 10(4):44, 2013. Google ScholarDigital Library
- Z. Zhao and X. Shen. On-the-fly principled speculation for FSM parallelization. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, Istanbul, Turkey, March 14--18, 2015, pages 619--630, 2015. Google ScholarDigital Library
- Z. Zhao, B. Wu, and X. Shen. Challenging the "embarrassingly sequential": Parallelizing finite state machine-based computations through principled speculation. In ASPLOS '14: Proceedings of 19th International Conference on Architecture Support for Programming Languages and Operating Systems. ACM Press, 2014. Google ScholarDigital Library
- Z. Zhao, B. Wu, M. Zhou, Y. Ding, J. Sun, X. Shen, and Y. Wu. Call sequence prediction through probabilistic calling automata. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA '14, pages 745--762. ACM, 2014. Google ScholarDigital Library
Index Terms
- Grammar-aware Parallelization for Scalable XPath Querying
Recommendations
Grammar-aware Parallelization for Scalable XPath Querying
PPoPP '17Semi-structured data emerge in many domains, especially in web analytics and business intelligence. However, querying such data is inherently sequential due to the nested structure of input data. Existing solutions pessimistically enumerate all ...
Reformulating XPath queries and XSLT queries on XSLT views
Applications using XML for data representation very often use different XML formats and thus require the transformation of XML data. The common approach transforms entire XML documents from one format into another, e.g. by using an XSLT stylesheet. ...
Filtering XPath expressions for XML access control
XPath is a standard for specifying parts of XML documents and a suitable language for both query processing and access control of XML. In this paper, we use the XPath expression for representing user queries and access control for XML. And we propose an ...
Comments