An enhanced Boolean retrieval model for efficient searching

A large number of information of all the domains are available online in the form of hyper text in web pages. Peoples from different domians are consulting different web sites to fetch information according to their need. It is very difficult to remember the names of the websites for a specific domain for which the user wants to search. So a search is a system which mines information from the World Wide Web and present it to the user according to its query. Information retrieval system (IRs) works for search engine arranges the web documents systematically and retrieves the result according to the user query. In this paper an efficient Boolean retrieval model is proposed which retrieves the results according to the according to the Boolean operation specified within the terms of the search query. Also the proposed model is capable to store large indexes.

Information retrieval is fast becoming the dominant form of information access.IR can also cover other kinds of data and information problems beyond that specified in the core definition above.The term "unstructured data" refers to data which does not have clear, semantically overt, easy-for-a-computer structure.It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records.In reality, almost no data are truly "unstructured".This is definitely true of all text data if you count the latent linguistic structure of human languages.But even accepting that the intended notion of structure is overt structure, most text has structure, such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web pages).IR is also used to facilitate "semi structured" search such as finding a document where the title contains Java and the body contains threading.The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents.Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents.It is similar to arranging books on a bookshelf according to their topic.Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which classes, if any, each of a set of documents belongs to.It is often approached by first manually classifying some documents and then hoping to be able to classify new documents automatically.Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales.In web search, the system has to provide search over billions of documents stored on millions of computers.Distinctive issues are needing to gather documents for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web.At the other extreme is personal information retrieval.In the last few years, consumer operating systems have integrated information retrieval (such as Apple's Mac OS X Spotlight or Windows Vista's Instant Search).Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders.Distinctive issues here include handling the broad range of document types on a typical personal computer, and making the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner.In between is the space of enterprise, institutional, and domain-specific search, where retrieval might be provided for collections such as a corporation's internal documents, a database of patents, or research articles on biochemistry.In this case, the documents will typically be stored on centralized file systems and one or a handful of dedicated machines will provide search over the collection.The figure below shows retrieval-IRs perform the following activities to achive its goal-1.In indexing the documents are arranged with respect to the terms in the document.2.Removal of unnecessary words, frequently used words which have less contribution in giving the watage to the document with respect to terms of the document.3.Fetching of documents according to the user query.

Boolean retrieval model
The Boolean retrieval model is a model for information in which any query can be pose which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT.The model views each document as just a set of words.The Basic assumptions of Information Retrieval are collection of fixed set of documents and goal is to retrieve documents with information that is relevant to the user's information need and helps the user complete a task.The retrieval model considers each document as relevant or irrelevant according to the user query.The figure below shows the visualization of Boolean retrieval model among the three set of documents.

INDEX CREATION IN BOOLEAN RETERIVAL MODEL
Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation.Suppose we have N = 1 million documents.By documents we mean whatever units we have decided to build a retrieval system over.They might be individual memos or chapters of a book.We will refer to the group of documents over which we perform retrieval as the (document) collection.It is sometimes also referred to as a corpus (a body of texts).Suppose each document is about 1000 words long (2-3 book pages).If we assume an average of 6 bytes per word including spaces and punctuation, then this is a document collection about 6 GB in size.Typically, there might be about M = 500,000 distinct terms in these documents.There is nothing special about the numbers we have chosen, and they might vary by an order of magnitude or more, but they give us some idea of the dimensions of the kinds of problems we need to handle.This idea is central to the first major concept in information retrieval, the inverted index.The name is actually redundant: an index always maps back from terms to the parts of a document where they occur.Nevertheless, inverted index, or sometimes inverted file, has become the standard term in information retrieval.

Token sequence
It generates Sequence of (Modified token, Document ID) pairs.

Sort by terms
Sort the sequence alphabetic vise.

Dictionary and postings
Multiple term entries in a single document are merged.Split into Dictionary and Postings and Document frequency information is added.The process in illustrated in the figure 2.2.3

Query processing
After the posting list created for the terms, then the query is processed to find the resultant documents from the postings.For example-Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings.Locate Caesar in the Dictionary; Retrieve its postings."Merge" the two postings:

Architecture of proposed model
After the user supplies the query, the query is processed and unnecessary terms will be removed from the query, the resultant query only contains the keywords with the proper Boolean operator.The documents available in the repository in html form is converted into text documents and preprocessed for removing cure words and meaningless words from the document, after conversion of the document into text document its size decreases by removing unnecessary tags from the html document.After that one by one document from the text repository is fetched and the inverted indices of the terms of the document is created now the postings created are stored in the excel file because storing the postings in excel is efficient as the size of the posting lists increases if the number of documents are more and storing these postings list in any other data structure is not efficient.Finally the created postings are merged and the resultant documents are generated according to the query.

Storage of indexes
The indexes created are stored in the excel sheet by using the java excel API, which provides a large storage to the indexes.

Conclusion
At last we make a conclusion that, information retrieval is a process of finding and fetching the knowledge based information from cluster or collection of documents.Boolean retrieval model used for information fetch is more accurate as compared to other retrieval models.The model creates the inverted indexes of terms and docs, on which boolean operation can be applied easily and show accurate result.