skip to main content
research-article
Free Access

HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing

Published:01 December 2013Publication History
Skip Abstract Section

Abstract

Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependencies in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms one of the final barriers for fully parallelizing browser operations to minimize the browser’s response time—an important variable for user experiences, especially on portable devices. This article provides a comprehensive analysis on the special complexities of parallel HTML parsing and presents a systematic exploration in overcoming those difficulties through specially designed speculative parallelizations. This work develops, to the best of our knowledge, the first pipelining and data-level parallel HTML parsers. The data-level parallel parser, named HPar, achieves up to 2.4× speedup on quadcore devices. This work demonstrates the feasibility of efficient, parallel HTML parsing for the first time and offers a set of novel insights for parallel HTML parsing

References

  1. HTML Living Standard: Section 12.2 Parsing HTML Documents. http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html.Google ScholarGoogle Scholar
  2. HTML reference. http://www.w3schools.com/tags.Google ScholarGoogle Scholar
  3. Jsoup: Java HTML Parser. http://jsoup.org.Google ScholarGoogle Scholar
  4. The HTML5 test. http://html5test.com/.Google ScholarGoogle Scholar
  5. Top 1,000 Web sites. http://www.alexa.com/topsites.Google ScholarGoogle Scholar
  6. Validator.nu. http://about.validator.nu/.Google ScholarGoogle Scholar
  7. Web pages getting bloated here is why. Retrieved from http://royal.pingdom.com/2011/11/21/web-pages-getting-bloated-here-is-why/.Google ScholarGoogle Scholar
  8. Alblas, H., op den Akker, R., Luttighuis, P. O., and Sikkel, K. 1994. A bibliography on parallel parsing. SIGPLAN Not. 29, 1, 54--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Baccelli, F. and Fleury, T. 1982. On parsing arithmetic expressions in a multiprocessing environment. Acta Inf. 17, 287--310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Baccelli, F. and Mussi, P. 1986. An asynchronous parallel interpreter for arithmetic expressions and its evaluation. IEEE Trans. Computers 35, 3, 245--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Badea, C., Haghighat, M. R., Nicolau, A., and Veidenbaum, A. V. 2010. Towards parallelizing the layout engine of Firefox. In Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism (HotPar’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Baer, J. L. and Ellis, C. S. 1977. Model, design, and evaluation of a compiler for a parallel processing environment. IEEE Trans. Softw. Eng. 3, 6, 394--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cascaval, C., Fowler, S., Montesinos-Ortego, P., Piekarski, W., Reshadi, M., Robatmili, B., Weber, M., and Bhavsar, V. 2013. Zoomm: a parallel web browser engine for multicore mobile devices. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Colohan, C. B., Zhai, A., and Mowry, T. C. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th International Symposium on High Performance Computer Architectures. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ding, C., Shen, X., Kelsey, K., Tice, C., Huang, R., and Zhang, C. 2007. Software behavior-oriented parallelization. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Earley, J. 1970. An efficient context-free parsing algorithm. Commun. ACM 13, 2, 94--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Schurman, E. and Brutlag, J. 2009. Performance related changes and their user impact. Retrieved from http://velocityconf.com/velocity2009.Google ScholarGoogle Scholar
  18. Feng, M., Gupta, R., and Hu, Y. 2011. Spicec: Scalable parallelism via implicit copying and explicit commit. In Proceedings of the ACM SIGPLAN Symposium on Principles Practice of Parallel Programming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fischer, C. N. 1975. On Parsing Context Free Languages in Parallel Environments. Ph.D. thesis, Cornell University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hoxmeier, J. and Dicesare, C. 2000. System response time and user satisfaction: An experimental study of browser-based applications. In Proceedings of the Association of Information Systems Americas Conference.Google ScholarGoogle Scholar
  21. Intel Corporation. n.d. River Trail. Retrieved from https://github.com/RiverTrail/RiverTrail.Google ScholarGoogle Scholar
  22. Jones, C. G., Liu, R., Meyerovich, L., Asanović, K., and Bodík, R. 2009. Parallelizing the web browser. In Proceedings of the 1st USENIX Conference on Hot topics in Parallelism (HotPar’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kasami, T. 1965. An Efficient Recognition and Syntax Analysis Algorithm for Context-free Languages. Technical Report AFCRL-65-758. Air Force Cambridge Research Laboratory.Google ScholarGoogle Scholar
  24. Kohavi, R. and Longbotham, R. 2007. Online experiments: Lessons learned. IEEE Computer 40, 9, 103--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lu, W., Chiu, K., and Pan, Y. 2006. A parallel approach to XML parsing. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (GRID’06). 223--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Luttighuis, P. 1989. Parallel Parsing of Regular Right-part Grammars. Memoranda Informatica.Google ScholarGoogle Scholar
  27. Mai, H., Tang, S., King, S. T., Cascaval, C., and Montesinos, P. 2012. A case for parallelizing web pages. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (HotPar’12). 2--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Marcuello, P. and Gonzá¡lez, A. 2002. Thread-spawning schemes for speculative multithreaded architectures. In Proceedings of the 8th International Symposium on High Performance Computer Architectures. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Meyerovich, L. A. and Bodik, R. 2010. Fast and parallel webpage layout. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Meyerovich, L. A., Torok, M. E., Atkinson, E., and Bodik, R. 2013. Parallel schedule synthesis for attribute grammars. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michael, M. M. and Scott, M. L. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Ssymposium on Principles of Distributed Computing (PODC’96). 267--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mozilla Corporation. n.d. Servo. https://github.com/mozilla/servo.Google ScholarGoogle Scholar
  33. Nah, F. 2004. Study on tolerable waiting time: How long are web users willing to wait? Behavior and Information Technology 23, 3, 153--163.Google ScholarGoogle Scholar
  34. Pan, Y., Lu, W., Zhang, Y., and Chiu, K. 2007. A static load-balancing scheme for parallel XML parsing on multicore CPUs. In Proceedings of the 7th International Symposium on Cluster Computing and the Grid (CCGRID’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Prabhu, P., Ramalingam, G., and Vaswani, K. 2010. Safe programmable speculative parallelism. In Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pronina, V. and Chudin, A. 1975. Syntax analysis implementation in an associative parallel processor. Automation and Remote Control 36, 8, 1303--308.Google ScholarGoogle Scholar
  37. Quiones, C., Madriles, C., S¡nchez, F. J., Marcuello, P., Gonzá¡lez, A., and Tullsen, D. M. 2005. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Raman, A., Kim, H., Mason, T. R., Jablin, T. B., and August, D. I. 2010. Speculative parallelization using software multi-threaded transactions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shah, B., Rao, P., Moon, B., and Rajagopalan, M. 2009. A data parallel algorithm for XML DOM parsing. Database and XML Technologies, Lecture Notes in Computer Science 5679. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Skillicorn, D. B. and Barnard, D. T. 1989. Parallel parsing on the connection machine. Inf. Process. Lett. 31, 3, 111--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Vachharajani, N., Rangan, R., Raman, E., Bridges, M. J., Ottoni, G., and August, D. I. 2007. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07). 49--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wang, Z., Lin, F. X., Zhong, L., and Chishtie, M. 2012. How far can client-only solutions go for mobile browser speed? In Proceedings of the 21st International Conference on World Wide Web (WWW’12). 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wu, Y., Zhang, Q., Yu, Z., and Li, J. 2008. A hybrid parallel processing for XML parsing and schema validation. In Proceedings of Balisage: The Markup Conference 2008.Google ScholarGoogle Scholar
  44. Wyk, E. and Schwerdfeger, A. 2007. Context-aware scanning for parsing extensible languages. In Proceedings of the International Conference on Generative Programming and Component Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yi, Y., Lai, C.-Y., Petrov, S., and Keutzer, K. 2011. Efficient parallel CKY parsing on GPUs. In Proceedings of the 12th International Conference on Parsing Technologies (IWPT’11). 175--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Younger, D. H. 1966. Context-free language processing in time n3. In Proceedings of the 7th Annual Symposium on Switching and Automata Theory. 7--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhao, Z., Wu, B., and Shen, X. 2012. Speculative parallelization needs rigor: Probabilistic analysis for optimal speculation of finite state machine applications. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 4
        December 2013
        1046 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2541228
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 December 2013
        • Accepted: 1 November 2013
        • Revised: 1 September 2013
        • Received: 1 June 2013
        Published in taco Volume 10, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader