Skip to main content

Differences and Identities in Document Retrieval in an Annotation Environment

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4777))

Abstract

Digital annotation of web pages presents two types of problems which are unknown to traditional annotation and which are connected to the dynamicity and the openness of the Web. The first problem is related to the possibility of replicating a document over multiple sites, so that it can be retrieved over the Web at different URLs or with different queries. This poses the need to associate to a web page all the annotations pertaining to its content, even if they were created while accessing the same content under a different URL. The second problem is related to the dynamics of individual HTML pages that often consist of insertions, deletions or movement of page segments. Annotations related to portions of the page that have moved within the page itself should be retrieved and shown to the user. To reduce the impact of these phenomena on the usefulness of the annotation process, our annotation system madcow incorporates two algorithms which assess the identity of two pages under two different URLs, and the differences between two versions of a page under the same URL, taking the proper actions in order to retrieve all the pertaining annotations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bottoni, P., Civica, R., Levialdi, S., Orso, L., Panizzi, E., Trinchese, R.: MADCOW: a Multimedia Digital Annotation System. In: AVI 2004, pp. 55–62. ACM Press, New York (2004)

    Chapter  Google Scholar 

  2. Bottoni, P., Levialdi, S., Panizzi, E., Pambuffetti, N., Trinchese, R.: Storing and retrieving multimedia web notes. IJCSE (to appear)

    Google Scholar 

  3. Bottoni, P., Levialdi, S., Rizzo, P.: An analysis and case study of digital annotation. In: Bianchi-Berthouze, N. (ed.) DNIS 2003. LNCS, vol. 2822, pp. 216–230. Springer, Heidelberg (2003)

    Google Scholar 

  4. Bottoni, P., Civica, R., Levialdi, S., Orso, L., Panizzi, E., Trinchese, R.: Storing and retrieving multimedia web notes. In: Bhalla, S. (ed.) DNIS 2005. LNCS, vol. 3433, pp. 119–137. Springer, Heidelberg (2005)

    Google Scholar 

  5. Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD 1995, pp. 398–409. ACM Press, New York (1995)

    Chapter  Google Scholar 

  6. Broder, A.: On the resemblance and containment of documents. In: SEQUENCES 1997, vol. 00, page. 21. IEEE Computer Society Press, Los Alamitos, CA, USA (1997)

    Google Scholar 

  7. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)

    Article  Google Scholar 

  8. Manber, U.: Finding similar files in a large filesystem. In: 1994 Winter USENIX Technical Conference, pp. 1–10 (1994)

    Google Scholar 

  9. Pugh, W., Henzinger, M.H.: Detecting duplicate and near-duplicate files. US Patent 6658423 (December 2003)

    Google Scholar 

  10. Rabin, M.O.: Fingerprinting by random polynomials. Report TR-15-81, Center for research in computing technology, Harvard University (1981)

    Google Scholar 

  11. Sanderson, M.: Duplicate detection in the Reuters collection. Technical Report TR-1997-5, Department of Computer Science, University of Glasgow (1997)

    Google Scholar 

  12. Shivakumar, N., Garcia-Molina, H.: Scam: a copy detection mechanism for digital documents. In: Proc. International Conference on Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  13. Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mechanism. In: DL 1996, pp. 160–168. ACM Press, New York (1996)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Subhash Bhalla

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bottoni, P., Cuomo, M., Levialdi, S., Panizzi, E., Passavanti, M., Trinchese, R. (2007). Differences and Identities in Document Retrieval in an Annotation Environment. In: Bhalla, S. (eds) Databases in Networked Information Systems. DNIS 2007. Lecture Notes in Computer Science, vol 4777. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75512-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75512-8_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75511-1

  • Online ISBN: 978-3-540-75512-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics