Applying VSM and LCS to develop an integrated text retrieval mechanism
Introduction
Text retrieval is one of the most important issues in computer science (Delen and Crossland, 2008, Gupta and Lehal, 2009, Tan, 1999), and it refers to a process to find the stored texts that contain a specific query (e.g., a term or a sentence) (Blair & Maron, 1985). Nowadays, when most information (over 80%) is stored in electronic form Gupta and Lehal (2009), this means that current information systems must be capable of retrieving a large amount of text data (Berry & Castellanos, 2007). The text retrieval technologies provide a promising way to enable these systems to retrieve information. Consequently, text retrieval technologies have received much attention in the field of computer science.
Vector space models (VSM) and similarity measurement (SM) are two of the most significant techniques in text retrieval (Salton and Lesk, 1968, Salton and Yang, 1973, Zhang and Rasmussen, 2001). VSM is used to implement text representation. The text representation refers to a query, and texts are segmented into several numerical elements in which each element corresponds to a specific term. VSM employs a text collection to transform the query and the texts into several numerical elements in a vector space, and thus the query and the texts can be represented as a vector. When the query and the texts are represented in this way, the similarity between them can be calculated through SM, which determines whether they are related to the same topics (Zhang & Rasmussen, 2001). The most widely used measurements for this process include the Cosine, Dice, and Jaccard approaches (Manning & Schutze, 2001). Cosine is based on the angle between vectors, while Dice and Jaccard are based on the interaction and union between vectors. Thus whether or not the texts are relevant to a specific query can be retrieved through VSM and SM (Zhang & Rasmussen, 2001).
Although VSM and SM are accepted as useful ways of retrieving text data, a significant problem remains with their current applications. VSM uses the frequency of occurrence of the term (abbreviated as weight information for the rest of this paper) to transform the query and the texts into vectors, and then SM uses these vectors to calculate the similarity between the texts and query. This means that the sequence of occurrence of the term (abbreviated as the sequence information, hereafter) is not used as the basis of similarity judgment. This is a significant problem, because the sequence of occurrence of a term plays an important role in the process of text retrieval. For example, users may retrieve a book with the content according to a specific sequence. However, current text retrieval systems cannot provide users with such a service. This means that users need to check books one by one in order to retrieve the requisite one. It is thus of considerable interest to find a way to overcome this problem with text retrieval applications.
In this paper, an integrated text retrieval (ITR) mechanism is proposed which takes the weight and sequence information into account. In the ITR mechanism, VSM is adopted to deal with the weight information, and the longest common subsequence (LCS) concept is used to cope with the sequence information. LCS is used to find the longest common subsequence of both strings, this means that LCS can be used to evaluate the sequential relationship between the query and the texts (Huang et al., 2008, Xiao et al., 2005). In the proposed mechanism, VSM is used to transform the query and the texts into numerical elements in a vector space, and then LCS is used to adjust the numerical elements according to the sequential relationship between the text and the query. Once the numerical elements are produced by VSM and LCS, SM is used to calculate the similarity between the text and the query. In this manner, the weight and sequence information can be used as the basis of the similarity judgment and further develop an ITR mechanism. To explore the feasibility of the ITR mechanism, a set of numerical analyses was conducted. The results show that our proposed mechanism can increase the similarity on the Dice and Jaccard measurements if a sequential relationship exists between the text and the query.
The rest of this paper is organized as follows. The background and related studies are described in Section 2. In the Section 3, we describe the ITR mechanism. Section 4 shows the results of numerical analysis. Finally, a brief conclusion to this work is given in Section 5.
Section snippets
Background and related studies
The similarity measurement is one of the most important processes in text retrieval, and this section describes the current approaches (Cunningham, 2009, Manning et al., 2008). In this work, we focus on the vector- and the sequence-based methods, and their details of their classification are shown in Fig. 1. Vector-based schemes include the Cosine, Jaccard, and Dice approaches (Kim & Choi, 1999), while sequence-based ones include the WF algorithm, NW algorithm, SW algorithm methods (Lavenier &
Integrated text retrieval mechanism
This study adopts the weight and sequence information to develop an integrated text retrieval (ITR) mechanism. Fig. 2 shows the flow diagram of the IRT mechanism, which is composed of five stages. First, the terms are extracted from the query and the texts in the text preprocessing stage. Second, tf–idf is used to transform the terms of the query and the texts into the tf–idf weight in the text representation stage. Third, LCS is used to calculate the sequential relationship between the query
Proof of Cosine
This section is to prove that the Cosine measurement is not suitable for the IRT mechanism, because it uses the angle between vectors to calculate the similarity. The assumptions used in this proof are listed below.
- •
Assume that v1, v2 are vectors.
- •
Assume that v1 is equal to (100, 20).
- •
Assume that v2 is equal to (200, 60).
- •
Assume that v1′ is the result of text re-representation of v1.
The result of the Cosine measurement between v1 and v2 is 0.996. After the text re-representation, v1′ is equal to (190,
Conclusions
Text retrieval is a critical technology in information systems. However, previous studies have not considered the effect of the sequence of the information. In this paper, we integrated VSM and LCS to develop an ITR mechanism, which is used to deal with the weight and sequence information. First of all, VSM was used to evaluate the weight information between the query and the texts, and then LCS was used to evaluate the sequential information between the query and the texts. Afterward, the
Acknowledgements
The authors thank the National Science Council of the Republic of China for financially supporting this research under Contract No. NSC 97-2511-S-006-001-MY3, NSC 98-2631-S-024-001, and NSC 99-2631-S-006-001.
References (29)
- et al.
Seeding the survey and analysis of research literature with text mining
Expert Systems with Applications
(2008) - et al.
A comparison of collocation-based similarity measures in query expansion
Information Processing & Management
(1999) - et al.
Generalized Needleman–Wunsch algorithm for the recognition of T-cell epitopes
Expert Systems with Applications
(2008) - et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology
(1970) - et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988) - et al.
Identification of common molecular subsequences
Journal of Molecular Biology
(1981) - et al.
Sequence alignment and penalty choice: Review of concepts, case studies and implications
Journal of Molecular Biology
(1994) - et al.
Developing a new similarity measure from two different perspectives
Information Processing & Management
(2001) - et al.
Survey of Text Mining: Clustering, Classification, and Retrieval
(2007) - et al.
An evaluation of retrieval effectiveness for a full-text document-retrieval system
Communications of the ACM
(1985)
Phase-only filtering for the masses (of DNA data): A new approach to sequence alignment
IEEE Transactions on Signal Processing
Introduction to Algorithms
A taxonomy of similarity mechanisms for case-based reasoning
IEEE Transactions on Knowledge and Data Engineering
A survey of text mining techniques and applications
Journal of Emerging Technologies in Web Intelligence
Cited by (11)
Recognition method of fault recorder file channel name based on incremental learning optimization
2023, Dianli Xitong Baohu yu Kongzhi/Power System Protection and ControlResearch on Improved Sentence Similarity Calculation Method Based on Word2Vec and Synonym Table in Interactive Machine Translation
2021, 2021 5th International Conference on Robotics and Automation Sciences, ICRAS 2021An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering
2021, Computational and Mathematical Methods in MedicineIncremental Rapidly Grouping Aggregation Method for Similar Web News Headline
2020, Journal of Physics: Conference SeriesSentence similarity calculation method based on lexical, syntactic and semantic
2019, Dongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Southeast University (Natural Science Edition)Research on Aggregation Model for Chinese Short Texts
2017, Ruan Jian Xue Bao/Journal of Software