skip to main content
research-article

Using web corpus statistics for program analysis

Published:15 October 2014Publication History
Skip Abstract Section

Abstract

Several program analysis tools - such as plagiarism detection and bug finding - rely on knowing a piece of code's relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common a line of code is. Natural language processing researchers have encountered a similar problem and addressed it using an n-gram model of text frequency, derived from statistics computed over text corpora.

We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as a subgraph of the program dependence graph that contains all nodes and edges reachable in n steps from the statement. We can count n-grams in a program and count the frequency of n-grams in the corpus, enabling us to compute tf-idf-style measures that capture the differing importance of different lines of code. We demonstrate the power of this approach by implementing a plagiarism detector with accuracy that beats previous techniques, and a bug-finding tool that discovered over a dozen previously unknown bugs in a collection of real deployed programs.

References

  1. A. V. Aho, M. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools: Second Edition. Addison-Wesley, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In EMNLP-CoNLL, pages 858--867, 2007.Google ScholarGoogle Scholar
  3. D. Cai and M. Kim. An empirical study of long-lived code clones. In FASE, pages 432--446, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. S. Evans, C. W. Fraser, and F. Ma. Clone detection via structural abstraction. Software Quality Journal, 17(4):309--330, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319--349, July 1987. ISSN 0164-0925. URL http://doi.acm.org/10.1145/24039. 24041. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Software Engineering, 2008. ICSE'08. ACM/IEEE 30th International Conference on, pages 321--330. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Green, P. C. Lane, A. Rainer, S. Bennett, and S.-B. Scholz. Same difference: Detecting collusion by finding unusual shared elements. In Proceedings of the Fifth International Plagiarism Conference, 2012.Google ScholarGoogle Scholar
  8. A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, pages 837--847, Piscataway, NJ, USA, 2012. IEEE Press. ISBN 978-1-4673-1067-3. URL http://dl.acm.org/citation.cfm?id=2337223.2337322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Islam and D. Inkpen. Real-word spelling correction using google web 1tn-gram data set. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM, pages 1689--1692, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-512-3. URL http://doi.acm.org/10.1145/1645953.1646205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering, ICSE '07, pages 96--105, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-2828-7. URL http://dx.doi.org/10.1109/ICSE.2007.30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Juergens, F. Deissenboeck, B. Hummel, and S.Wagner. Do code clones matter? In Proceedings of the 31st International Conference on Software Engineering, ICSE '09, pages 485--495, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3453-4. URL http://dx.doi.org/10.1109/ICSE.2009.5070547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. M. Khoo, A. Mycroft, and R. Anderson. Rendezvous: a search engine for binary code. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 329--338. IEEE Press, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In Static Analysis, pages 40--56. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Li, S. Lu, S. Myagmar, and Y. Zhou. Cp-miner: finding copy-paste and related bugs in large-scale software code. Software Engineering, IEEE Transactions on, 32(3):176--192, 2006. ISSN 0098-5589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 15--26, New York, NY, USA, 2005. ACM. ISBN 1-59593-056-6. URL http://doi.acm.org/10.1145/1065010.1065014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In KDD, pages 872--881, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Lucia, B. P. Wood, and L. Ceze. Isolating and understanding concurrency errors using reconstructed execution fragments. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 378--388, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0663-8. URL http://doi.acm.org/10.1145/1993498.1993543. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  19. J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331:176--, Jan. 2011.Google ScholarGoogle Scholar
  20. S. S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-320-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532--542, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9. URL http://doi.acm.org/10.1145/2491411.2491458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with jplag. Journal of Universal Computer Science, 8:1016--1038, 2001.Google ScholarGoogle Scholar
  23. M. Ravallion. The two poverty enlightenments: Historical insights from digitized books spanning three centuries. Poverty and Public Policy, 3(2):1--46, 2011. ISSN 1944-2858.Google ScholarGoogle ScholarCross RefCross Ref
  24. V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, page 44. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD Conference, pages 76--85, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. In P.Willett, editor, Document retrieval systems, pages 132--142. Taylor Graham Publishing, London, UK, UK, 1988. ISBN 0-947568-21-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Thomas, S. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. In Data Mining, 2006. ICDM '06. Sixth International Conference on, pages 1097--1101, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Williams, J. Huan, and W. Wang. Graph database indexing using structured graph decomposition. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 976--985, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based approach. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD '04, pages 335--346, New York, NY, USA, 2004. ACM. ISBN 1-58113-859-8. URL http://doi.acm.org/10.1145/1007568.1007607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Zou, L. Chen, J. X. Yu, and Y. Lu. A novel spectral coding in a large graph database. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology, EDBT '08, pages 181--192, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-926-5. URL http://doi.acm.org/10.1145/1353343.1353369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Zou, L. Chen, andM. T. Özsu. Distance-join: pattern match query in a large graph database. Proc. VLDB Endow., 2(1): 886--897, Aug. 2009. ISSN 2150-8097. URL http://dl.acm.org/citation.cfm?id=1687627.1687727. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Using web corpus statistics for program analysis

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 49, Issue 10
            OOPSLA '14
            October 2014
            907 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/2714064
            • Editor:
            • Andy Gill
            Issue’s Table of Contents
            • cover image ACM Conferences
              OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications
              October 2014
              946 pages
              ISBN:9781450325851
              DOI:10.1145/2660193

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 October 2014

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader