skip to main content
10.1145/872757.872770acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Winnowing: local algorithms for document fingerprinting

Published:09 June 2003Publication History

ABSTRACT

Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.

References

  1. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brenda S. Baker. On finding duplication and near-duplication in large software systems. In L. Wills, P. Newcomb, and E. Chikofsky, editors, Second Working Conference on Reverse Engineering, pages 86--95, Los Alamitos, California, 1995. IEEE Computer Society Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Brenda S. Baker and Udi Manber. Deducing similarities in java sources from bytecodes. In Proc. of Usenix Annual Technical Conf., pages 179--190, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sergey Brin, James Davis, and Héctor García-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Conference, pages 398--409, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrei Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pages 391--404, April 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. The Crystals. Da do run run, 1963.]]Google ScholarGoogle Scholar
  8. Nevin Heintze. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic Commerce, November 1996.]]Google ScholarGoogle Scholar
  9. James Joyce. Finnegans wake {1st trade ed.}. Faber and Faber (London), 1939.]]Google ScholarGoogle Scholar
  10. Richard M. Karp and Michael O. Rabin. Pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249--260, 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sergio Leone, Clint Eastwood, Eli Wallach, and Lee Van Cleef. The Good, the Bad and the Ugly / Il Buono, Il Brutto, Il Cattivo (The Man with No Name). Produzioni Europee Associate (Italy) Production, Distributed by United Artists (USA), 1966.]]Google ScholarGoogle Scholar
  12. Udi Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1--10, San Fransisco, CA, USA, 17--21 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Peter Mork, Beitao Li, Edward Chang, Junghoo Cho, Chen Li, and James Wang. Indexing tamper resistant features for image copy detection, 1999. URL: citeseer.nj.nec.com/mork99indexing.html.]]Google ScholarGoogle Scholar
  14. Narayanan Shivakumar and Héctor García-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.]]Google ScholarGoogle Scholar
  15. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249--260, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. George K. Zipf. The Psychobiology of Language. Houghton Mifltm Co., 1935.]]Google ScholarGoogle Scholar

Index Terms

  1. Winnowing: local algorithms for document fingerprinting

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
                June 2003
                702 pages
                ISBN:158113634X
                DOI:10.1145/872757

                Copyright © 2003 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 9 June 2003

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader