Abstract
Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependencies in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms one of the final barriers for fully parallelizing browser operations to minimize the browser’s response time—an important variable for user experiences, especially on portable devices. This article provides a comprehensive analysis on the special complexities of parallel HTML parsing and presents a systematic exploration in overcoming those difficulties through specially designed speculative parallelizations. This work develops, to the best of our knowledge, the first pipelining and data-level parallel HTML parsers. The data-level parallel parser, named HPar, achieves up to 2.4× speedup on quadcore devices. This work demonstrates the feasibility of efficient, parallel HTML parsing for the first time and offers a set of novel insights for parallel HTML parsing
- HTML Living Standard: Section 12.2 Parsing HTML Documents. http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html.Google Scholar
- HTML reference. http://www.w3schools.com/tags.Google Scholar
- Jsoup: Java HTML Parser. http://jsoup.org.Google Scholar
- The HTML5 test. http://html5test.com/.Google Scholar
- Top 1,000 Web sites. http://www.alexa.com/topsites.Google Scholar
- Validator.nu. http://about.validator.nu/.Google Scholar
- Web pages getting bloated here is why. Retrieved from http://royal.pingdom.com/2011/11/21/web-pages-getting-bloated-here-is-why/.Google Scholar
- Alblas, H., op den Akker, R., Luttighuis, P. O., and Sikkel, K. 1994. A bibliography on parallel parsing. SIGPLAN Not. 29, 1, 54--65. Google ScholarDigital Library
- Baccelli, F. and Fleury, T. 1982. On parsing arithmetic expressions in a multiprocessing environment. Acta Inf. 17, 287--310.Google ScholarDigital Library
- Baccelli, F. and Mussi, P. 1986. An asynchronous parallel interpreter for arithmetic expressions and its evaluation. IEEE Trans. Computers 35, 3, 245--256. Google ScholarDigital Library
- Badea, C., Haghighat, M. R., Nicolau, A., and Veidenbaum, A. V. 2010. Towards parallelizing the layout engine of Firefox. In Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism (HotPar’10). Google ScholarDigital Library
- Baer, J. L. and Ellis, C. S. 1977. Model, design, and evaluation of a compiler for a parallel processing environment. IEEE Trans. Softw. Eng. 3, 6, 394--405. Google ScholarDigital Library
- Cascaval, C., Fowler, S., Montesinos-Ortego, P., Piekarski, W., Reshadi, M., Robatmili, B., Weber, M., and Bhavsar, V. 2013. Zoomm: a parallel web browser engine for multicore mobile devices. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 271--280. Google ScholarDigital Library
- Colohan, C. B., Zhai, A., and Mowry, T. C. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th International Symposium on High Performance Computer Architectures. Google ScholarDigital Library
- Ding, C., Shen, X., Kelsey, K., Tice, C., Huang, R., and Zhang, C. 2007. Software behavior-oriented parallelization. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI’07). Google ScholarDigital Library
- Earley, J. 1970. An efficient context-free parsing algorithm. Commun. ACM 13, 2, 94--102. Google ScholarDigital Library
- Schurman, E. and Brutlag, J. 2009. Performance related changes and their user impact. Retrieved from http://velocityconf.com/velocity2009.Google Scholar
- Feng, M., Gupta, R., and Hu, Y. 2011. Spicec: Scalable parallelism via implicit copying and explicit commit. In Proceedings of the ACM SIGPLAN Symposium on Principles Practice of Parallel Programming. Google ScholarDigital Library
- Fischer, C. N. 1975. On Parsing Context Free Languages in Parallel Environments. Ph.D. thesis, Cornell University. Google ScholarDigital Library
- Hoxmeier, J. and Dicesare, C. 2000. System response time and user satisfaction: An experimental study of browser-based applications. In Proceedings of the Association of Information Systems Americas Conference.Google Scholar
- Intel Corporation. n.d. River Trail. Retrieved from https://github.com/RiverTrail/RiverTrail.Google Scholar
- Jones, C. G., Liu, R., Meyerovich, L., Asanović, K., and Bodík, R. 2009. Parallelizing the web browser. In Proceedings of the 1st USENIX Conference on Hot topics in Parallelism (HotPar’09). Google ScholarDigital Library
- Kasami, T. 1965. An Efficient Recognition and Syntax Analysis Algorithm for Context-free Languages. Technical Report AFCRL-65-758. Air Force Cambridge Research Laboratory.Google Scholar
- Kohavi, R. and Longbotham, R. 2007. Online experiments: Lessons learned. IEEE Computer 40, 9, 103--105. Google ScholarDigital Library
- Lu, W., Chiu, K., and Pan, Y. 2006. A parallel approach to XML parsing. In Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (GRID’06). 223--230. Google ScholarDigital Library
- Luttighuis, P. 1989. Parallel Parsing of Regular Right-part Grammars. Memoranda Informatica.Google Scholar
- Mai, H., Tang, S., King, S. T., Cascaval, C., and Montesinos, P. 2012. A case for parallelizing web pages. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (HotPar’12). 2--2. Google ScholarDigital Library
- Marcuello, P. and Gonzá¡lez, A. 2002. Thread-spawning schemes for speculative multithreaded architectures. In Proceedings of the 8th International Symposium on High Performance Computer Architectures. Google ScholarDigital Library
- Meyerovich, L. A. and Bodik, R. 2010. Fast and parallel webpage layout. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). Google ScholarDigital Library
- Meyerovich, L. A., Torok, M. E., Atkinson, E., and Bodik, R. 2013. Parallel schedule synthesis for attribute grammars. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 187--196. Google ScholarDigital Library
- Michael, M. M. and Scott, M. L. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Ssymposium on Principles of Distributed Computing (PODC’96). 267--275. Google ScholarDigital Library
- Mozilla Corporation. n.d. Servo. https://github.com/mozilla/servo.Google Scholar
- Nah, F. 2004. Study on tolerable waiting time: How long are web users willing to wait? Behavior and Information Technology 23, 3, 153--163.Google Scholar
- Pan, Y., Lu, W., Zhang, Y., and Chiu, K. 2007. A static load-balancing scheme for parallel XML parsing on multicore CPUs. In Proceedings of the 7th International Symposium on Cluster Computing and the Grid (CCGRID’07). Google ScholarDigital Library
- Prabhu, P., Ramalingam, G., and Vaswani, K. 2010. Safe programmable speculative parallelism. In Proceedings of ACM SIGPLAN Conference on Programming Languages Design and Implementation. Google ScholarDigital Library
- Pronina, V. and Chudin, A. 1975. Syntax analysis implementation in an associative parallel processor. Automation and Remote Control 36, 8, 1303--308.Google Scholar
- Quiones, C., Madriles, C., S¡nchez, F. J., Marcuello, P., Gonzá¡lez, A., and Tullsen, D. M. 2005. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Raman, A., Kim, H., Mason, T. R., Jablin, T. B., and August, D. I. 2010. Speculative parallelization using software multi-threaded transactions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarDigital Library
- Shah, B., Rao, P., Moon, B., and Rajagopalan, M. 2009. A data parallel algorithm for XML DOM parsing. Database and XML Technologies, Lecture Notes in Computer Science 5679. Google ScholarDigital Library
- Skillicorn, D. B. and Barnard, D. T. 1989. Parallel parsing on the connection machine. Inf. Process. Lett. 31, 3, 111--117. Google ScholarDigital Library
- Vachharajani, N., Rangan, R., Raman, E., Bridges, M. J., Ottoni, G., and August, D. I. 2007. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07). 49--59. Google ScholarDigital Library
- Wang, Z., Lin, F. X., Zhong, L., and Chishtie, M. 2012. How far can client-only solutions go for mobile browser speed? In Proceedings of the 21st International Conference on World Wide Web (WWW’12). 31--40. Google ScholarDigital Library
- Wu, Y., Zhang, Q., Yu, Z., and Li, J. 2008. A hybrid parallel processing for XML parsing and schema validation. In Proceedings of Balisage: The Markup Conference 2008.Google Scholar
- Wyk, E. and Schwerdfeger, A. 2007. Context-aware scanning for parsing extensible languages. In Proceedings of the International Conference on Generative Programming and Component Engineering. Google ScholarDigital Library
- Yi, Y., Lai, C.-Y., Petrov, S., and Keutzer, K. 2011. Efficient parallel CKY parsing on GPUs. In Proceedings of the 12th International Conference on Parsing Technologies (IWPT’11). 175--185. Google ScholarDigital Library
- Younger, D. H. 1966. Context-free language processing in time n3. In Proceedings of the 7th Annual Symposium on Switching and Automata Theory. 7--20. Google ScholarDigital Library
- Zhao, Z., Wu, B., and Shen, X. 2012. Speculative parallelization needs rigor: Probabilistic analysis for optimal speculation of finite state machine applications. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. Google ScholarDigital Library
Index Terms
- HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing
Recommendations
Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers
Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the ...
Parallelizing user-defined and implicit reductions globally on multiprocessors
ACSAC'06: Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems ArchitectureMultiprocessors are becoming prevalent in the PC world. Major CPU vendors such as Intel and Advanced Micro Devices have migrated to multicore processors. However, this also means that computers will run an application at full speed only if that ...
A cluster for CS education in the manycore era
SIGCSE '11: Proceedings of the 42nd ACM technical symposium on Computer science educationTraditional Beowulf clusters have been homogeneous platforms for distributed-memory MIMD parallelism. However, the shift to multicore architectures has made shared-memory MIMD parallelism increasingly important, and inexpensive manycore GPGPUs have ...
Comments