Applying VSM and LCS to develop an integrated text retrieval mechanism

https://doi.org/10.1016/j.eswa.2011.09.039Get rights and content

Abstract

Text retrieval has received a lot of attention in computer science. In the text retrieval field, the most widely-adopted similarity technique is using vector space models (VSM) to evaluate the weight of terms and using Cosine, Jaccard or Dice to measure the similarity between the query and the texts. However, these similarity techniques do not consider the effect of the sequence of the information. In this paper, we propose an integrated text retrieval (ITR) mechanism that takes the advantage of both VSM and longest common subsequence (LCS) algorithm. The key idea of the ITR mechanism is to use LCS to re-evaluate the weight of terms, so that the sequence and weight relationships between the query and the texts can be considered simultaneously. The results of mathematical analysis show that the ITR mechanism can increase the similarity on Jaccard and Dice similarity measurements when a sequential relationship exists between the query and the texts.

Introduction

Text retrieval is one of the most important issues in computer science (Delen and Crossland, 2008, Gupta and Lehal, 2009, Tan, 1999), and it refers to a process to find the stored texts that contain a specific query (e.g., a term or a sentence) (Blair & Maron, 1985). Nowadays, when most information (over 80%) is stored in electronic form Gupta and Lehal (2009), this means that current information systems must be capable of retrieving a large amount of text data (Berry & Castellanos, 2007). The text retrieval technologies provide a promising way to enable these systems to retrieve information. Consequently, text retrieval technologies have received much attention in the field of computer science.

Vector space models (VSM) and similarity measurement (SM) are two of the most significant techniques in text retrieval (Salton and Lesk, 1968, Salton and Yang, 1973, Zhang and Rasmussen, 2001). VSM is used to implement text representation. The text representation refers to a query, and texts are segmented into several numerical elements in which each element corresponds to a specific term. VSM employs a text collection to transform the query and the texts into several numerical elements in a vector space, and thus the query and the texts can be represented as a vector. When the query and the texts are represented in this way, the similarity between them can be calculated through SM, which determines whether they are related to the same topics (Zhang & Rasmussen, 2001). The most widely used measurements for this process include the Cosine, Dice, and Jaccard approaches (Manning & Schutze, 2001). Cosine is based on the angle between vectors, while Dice and Jaccard are based on the interaction and union between vectors. Thus whether or not the texts are relevant to a specific query can be retrieved through VSM and SM (Zhang & Rasmussen, 2001).

Although VSM and SM are accepted as useful ways of retrieving text data, a significant problem remains with their current applications. VSM uses the frequency of occurrence of the term (abbreviated as weight information for the rest of this paper) to transform the query and the texts into vectors, and then SM uses these vectors to calculate the similarity between the texts and query. This means that the sequence of occurrence of the term (abbreviated as the sequence information, hereafter) is not used as the basis of similarity judgment. This is a significant problem, because the sequence of occurrence of a term plays an important role in the process of text retrieval. For example, users may retrieve a book with the content according to a specific sequence. However, current text retrieval systems cannot provide users with such a service. This means that users need to check books one by one in order to retrieve the requisite one. It is thus of considerable interest to find a way to overcome this problem with text retrieval applications.

In this paper, an integrated text retrieval (ITR) mechanism is proposed which takes the weight and sequence information into account. In the ITR mechanism, VSM is adopted to deal with the weight information, and the longest common subsequence (LCS) concept is used to cope with the sequence information. LCS is used to find the longest common subsequence of both strings, this means that LCS can be used to evaluate the sequential relationship between the query and the texts (Huang et al., 2008, Xiao et al., 2005). In the proposed mechanism, VSM is used to transform the query and the texts into numerical elements in a vector space, and then LCS is used to adjust the numerical elements according to the sequential relationship between the text and the query. Once the numerical elements are produced by VSM and LCS, SM is used to calculate the similarity between the text and the query. In this manner, the weight and sequence information can be used as the basis of the similarity judgment and further develop an ITR mechanism. To explore the feasibility of the ITR mechanism, a set of numerical analyses was conducted. The results show that our proposed mechanism can increase the similarity on the Dice and Jaccard measurements if a sequential relationship exists between the text and the query.

The rest of this paper is organized as follows. The background and related studies are described in Section 2. In the Section 3, we describe the ITR mechanism. Section 4 shows the results of numerical analysis. Finally, a brief conclusion to this work is given in Section 5.

Section snippets

Background and related studies

The similarity measurement is one of the most important processes in text retrieval, and this section describes the current approaches (Cunningham, 2009, Manning et al., 2008). In this work, we focus on the vector- and the sequence-based methods, and their details of their classification are shown in Fig. 1. Vector-based schemes include the Cosine, Jaccard, and Dice approaches (Kim & Choi, 1999), while sequence-based ones include the WF algorithm, NW algorithm, SW algorithm methods (Lavenier &

Integrated text retrieval mechanism

This study adopts the weight and sequence information to develop an integrated text retrieval (ITR) mechanism. Fig. 2 shows the flow diagram of the IRT mechanism, which is composed of five stages. First, the terms are extracted from the query and the texts in the text preprocessing stage. Second, tfidf is used to transform the terms of the query and the texts into the tfidf weight in the text representation stage. Third, LCS is used to calculate the sequential relationship between the query

Proof of Cosine

This section is to prove that the Cosine measurement is not suitable for the IRT mechanism, because it uses the angle between vectors to calculate the similarity. The assumptions used in this proof are listed below.

  • Assume that v1, v2 are vectors.

  • Assume that v1 is equal to (100, 20).

  • Assume that v2 is equal to (200, 60).

  • Assume that v1′ is the result of text re-representation of v1.

The result of the Cosine measurement between v1 and v2 is 0.996. After the text re-representation, v1′ is equal to (190,

Conclusions

Text retrieval is a critical technology in information systems. However, previous studies have not considered the effect of the sequence of the information. In this paper, we integrated VSM and LCS to develop an ITR mechanism, which is used to deal with the weight and sequence information. First of all, VSM was used to evaluate the weight information between the query and the texts, and then LCS was used to evaluate the sequential information between the query and the texts. Afterward, the

Acknowledgements

The authors thank the National Science Council of the Republic of China for financially supporting this research under Contract No. NSC 97-2511-S-006-001-MY3, NSC 98-2631-S-024-001, and NSC 99-2631-S-006-001.

References (29)

  • A.K. Brodzik

    Phase-only filtering for the masses (of DNA data): A new approach to sequence alignment

    IEEE Transactions on Signal Processing

    (2006)
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • P. Cunningham

    A taxonomy of similarity mechanisms for case-based reasoning

    IEEE Transactions on Knowledge and Data Engineering

    (2009)
  • V. Gupta et al.

    A survey of text mining techniques and applications

    Journal of Emerging Technologies in Web Intelligence

    (2009)
  • Cited by (11)

    View all citing articles on Scopus
    View full text