ABSTRACT
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents.We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service.
- Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2--43, 2001.]] Google ScholarDigital Library
- Brenda S. Baker. On finding duplication and near-duplication in large software systems. In L. Wills, P. Newcomb, and E. Chikofsky, editors, Second Working Conference on Reverse Engineering, pages 86--95, Los Alamitos, California, 1995. IEEE Computer Society Press.]] Google ScholarDigital Library
- Brenda S. Baker and Udi Manber. Deducing similarities in java sources from bytecodes. In Proc. of Usenix Annual Technical Conf., pages 179--190, 1998.]] Google ScholarDigital Library
- Sergey Brin, James Davis, and Héctor García-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Conference, pages 398--409, 1995.]] Google ScholarDigital Library
- Andrei Broder. On the resemblance and containment of documents. In SEQS: Sequences '91, 1998.]]Google ScholarDigital Library
- Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pages 391--404, April 1997.]] Google ScholarDigital Library
- The Crystals. Da do run run, 1963.]]Google Scholar
- Nevin Heintze. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic Commerce, November 1996.]]Google Scholar
- James Joyce. Finnegans wake {1st trade ed.}. Faber and Faber (London), 1939.]]Google Scholar
- Richard M. Karp and Michael O. Rabin. Pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249--260, 1987.]] Google ScholarDigital Library
- Sergio Leone, Clint Eastwood, Eli Wallach, and Lee Van Cleef. The Good, the Bad and the Ugly / Il Buono, Il Brutto, Il Cattivo (The Man with No Name). Produzioni Europee Associate (Italy) Production, Distributed by United Artists (USA), 1966.]]Google Scholar
- Udi Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1--10, San Fransisco, CA, USA, 17--21 1994.]] Google ScholarDigital Library
- Peter Mork, Beitao Li, Edward Chang, Junghoo Cho, Chen Li, and James Wang. Indexing tamper resistant features for image copy detection, 1999. URL: citeseer.nj.nec.com/mork99indexing.html.]]Google Scholar
- Narayanan Shivakumar and Héctor García-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.]]Google Scholar
- Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249--260, 1995.]]Google ScholarDigital Library
- George K. Zipf. The Psychobiology of Language. Houghton Mifltm Co., 1935.]]Google Scholar
Index Terms
- Winnowing: local algorithms for document fingerprinting
Recommendations
Winnowing-based text clustering
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementWe present an approach to document clustering based on winnowing fingerprints that achieved good values of effectiveness with considerable save in memory space and computation time.
A systematic method for fingerprint ridge orientation estimation and image segmentation
This paper proposes a scheme for systematically estimating fingerprint ridge orientation and segmenting fingerprint image by means of evaluating the correctness of the ridge orientation based on neural network. The neural network is used to learn the ...
Robust Fingerprint Matching Using Spiral Partitioning Scheme
ICB '09: Proceedings of the Third International Conference on Advances in BiometricsFingerprint matching for low quality or partial fingerprint images is very challenging. It is mainly because the features such as minutia points can not be extracted reliably. In the case of partial fingerprint images captured using solid state sensors, ...
Comments