Abstract
When presented with a retrieved document, users of a search engine are usually left with the task of pinning down the relevant information inside the document. Often this is done by a time-consuming combination of skimming, scrolling and Ctrl+F. In the setting of a digital library for scientific literature the issue is especially urgent when dealing with reference works, such as surveys and handbooks, as these typically contain long documents. Our aim is to develop methods for providing a “go-read-here” type of retrieval functionality, which points the user to a segment where she can best start reading to find out about her topic of interest. We examine multiple query-independent ways of segmenting texts into coherent chunks that can be returned in response to a query. Most (experienced) authors use paragraph breaks to indicate topic shifts, thus providing us with one way of segmenting documents. We compare this structural method with semantic text segmentation methods, both with respect to topical focus and relevancy. Our experimental evidence is based on manually segmented scientific documents and a set of queries against this corpus. Structural segmentation based on contiguous blocks of relevant paragraphs is shown to be a viable solution for our intended application of providing “go-read-here” functionality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agosti, M., Allan, J. (eds.): Methods and Tools for the Automatic Construction of Hypertext. Special Issue of Information Processing and Management, vol. 33. Elsevier Science Ltd., Amsterdam (1997)
Allan, J.: Building hypertext using information retrieval. Information Precessing and Management 33(2), 145–159 (1997)
Baron, L., Tague-Sutcliffe, J., Kinnucan, M.T., Carey, T.: Labeled, typed links as cues when reading hypertext documents. Journal of the American Society for Information Science 47(12), 896–908 (1996)
Brown, G., Yule, G.: Cambridge Textbooks in Linguistics Series. Cambridge University Press, Cambridge (1983)
Callan, J.P.: Passage-level evidence in document retrieval. In: Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 302–310 (July 1994)
Choi, F.: Advances in independent linear text segmentation. In: Proc. of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL 2000), pp. 26–33 (2000)
Choi, F.: Linear text segmentation: approaches, advances and applications. In: Proc. of CLUK3 (2000)
Cohen, J.: The coefficient of agreement for nominal scales. Educational and Psychological Measurement 21(1), 37–46 (1960)
Conklin, J.: Hypertext: An introduction and survey. Computer 20(9), 17–41 (1987)
de Vries, A.P., Kazai, G., Lalmas, M.: Tolerance to irrelevance: A usereffort oriented evaluation of retrieval systems without predefined retrieval unit. In: Recherche d’Informations Assistee par Ordinateur (RIAO 2004) (April 2004)
DeRose, S.J.: Expanding the notion of links. In: Proc. of Hypertext 1999, pp. 249–257 (1989)
Harper, D.J., Coulthord, S., Yixing, S.: A language modeling approach to relevance profiling for document browsing. In: Proc. of JCDL (2002)
Hearst, M.A.: Context and Structure in Automated Full-text Information Access. PhD thesis, University of California at Berkeley (1994)
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proc. 32nd ACL (1994)
Hearst, M.A.: Tilebars: visualization of term distribution information in full text information access. In: Proc. of CHI 1995 (1995)
Hearst, M.A., Plaunt, C.: Subtopic structuring for full-lenght document access. In: Proc. of the 16th Annual International ACM SIGIR Conference on Research and Development in IR, pp. 59–68 (1993)
INEX. INitiative for the Evaluation of XML Retrieval (2004), http://inex.is.informatik.uni-duisburg.de:2004/
Kaszkiel, M., Zobel, J.: Passage retrieval revisited. In: Proc. of SIGIR 1997, pp. 178–185 (1997)
Lesk, M.: Understanding Digital Libraries, 2nd edn. The Morgan Kaufmann series in multimedia information and systems. Morgan Kaufmann, San Francisco (2005)
Manning, C.: Rethinking text segmentation models: An information extraction case study. Technical Report SULTRY-98-07-01, University of Sydney (1998)
Muskens, R., van Benthem, J., Visser, A.: Dynamics. In: Handbook of Logic and Language. Elsevier, Amsterdam (1997)
O’Neill, M., Denos, M.: Practical approach to the stereo matching of urban imagery. Image and Vision Computing 10(2), 89–98 (1992)
Ponte, J.M., Croft, W.B.: Text segmentation by topic. In: European Conference on Digital Libraries, pp. 113–125 (1997)
Rabiner, L.W., Schafer, R.W.: Digital processing of speech signals. Prentice-Hall, Inc., Englewood Cliffs (1978)
Reynar, J.C.: Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania (1998)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 25, 513–523 (1988)
Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proc. of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, USA, pp. 49–58 (1993)
Salton, G., Allan, J., Singhal, A.: Automatic text decomposition and structuring. Information Processing and Management 32(2), 127–138 (1996)
Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Proc. of the 7th ACM Conference on Hypertext, Washington, DC, USA (1996)
Skorochod’ko, E.: Adaptive method of automatic abstracting and indexing. Information Processing 71, 1179–1182 (1972)
Stokes, N., Carthy, J., Smeaton, A.F.: Segmenting broadcast news streams using lexical chaining. In: Vidal, T., Liberatore, P. (eds.) Proc. of STAIRS 2002, vol. 1, pp. 145–154. IOS Press, Amsterdam (2002)
Tenopir, C., King, D.W.: Reading behaviour and electronic journals. Learned Publishing 15(4), 159–165 (2002)
Trigg, R.: A network approach to text handling for the online scientifc community. PhD thesis, University of Maryland (1983)
van Benthem, J., ter Meulen, A. (eds.): Handbook of Logic and Language. Elsevier, Amsterdam (1997)
van Dijk, T.: Some Aspects of Text Grammar. Mouton (1972)
van Eijck, J., Kamp, H.: Representing discourse in context. In: Handbook of Logic and Language. Elsevier, Amsterdam (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Caracciolo, C., de Rijke, M. (2006). Generating and Retrieving Text Segments for Focused Access to Scientific Documents. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_31
Download citation
DOI: https://doi.org/10.1007/11735106_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)