Document Clustering Approach for Meta Search Engine

The size of WWW is growing exponentially with ever change in technology. This results in huge amount of information with long list of URLs. Manually it is not possible to visit each page individually. So, if the page ranking algorithms are used properly then user search space can be restricted up to some pages of searched results. But available literatures show that no single search system can provide qualitative results from all the domains. This paper provides solution to this problem by introducing a new meta search engine that determine the relevancy of query corresponding to web page and cluster the results accordingly. The proposed approach reduces the user efforts, improves the quality of results and performance of the meta search engine.


INTRODUCTION
The gigantic size [1] of the web and the number of web pages is growing exponentially [2]. The information from such types of repository can only be retrieved by using some tools like search engines. As per a literature available in [3] every Search Engine (SE) has its limited search space and expertization [4] which may restrict the number of relevant web pages returned to user. Furthermore a study of [5] and [6] also indicates that coverage and the precision of different SE are diverse and restricted in nature. Therefore a single SE is helpless to fulfil all the requirement of the end user from all domains. This is the major reason due to which technology developer are using the concept of meta search engine (multiple SEs on single interface) [7], [8]. Meta Search Engine (MSE) is used to receive the information from many SEs concurrently [8], [9]. It receives the URLs from different SEs, delete the redundant results and present them to its users [8]. Efficiency of any retrieval system depends upon the relevancy and presentation of results to end user [5] where the existing MSE fails. Therefore this paper proposed an architecture to overcome the problem of relevancy and presentation of results. The proposed architecture will reduce the end user efforts for searching the results.

RELATED WORK
A MSE called Helios was proposed in [10]. Initially Helios was implemented by using eighteen SEs. This limit can be expanded as per demand. In it HTTP Retriever module had the responsibility of handling the network communications. It uses a dual PIV 2.60 GHz, 1.5 GB of RAM and 100 Mbps internet connection to implement the proposed approach. The performance of Helios was compared with wget. Nearly 600 results were retrieved in 12.4 seconds whereas Helios retrieved the same number of results in 4.6 seconds. According to the authors this approach can be used highly engineered open-source parallel meta-search engines and can be used in industrial environments.
A MSE was purposed in [11] where three SEs -Google, Yahoo and Baidu were used in implementation. The position of the words and the snippets of the webpage were used to calculate the similarity of webpage. Top twenty results were selected to test the proposed approach. The results were tested physically by using the predefined criterion. TREC -style average precision was used to evaluate the results. At the end they claimed that most relevant results were appeared on the top of the returned resultant list.
Authors of [12], proposed a Multi domain MSE for effective presentation of results. It provides the facility of selection of specialized SEs. Relevancy, Reliability, Redundancy of results and accessibility of searched results were considered for performance measurement. The searched results were shown in the corresponding search engine window only. Finally the authors proved that the performance of proposed MSE is better than the individual SE.
A MSE based on learning from query logs by using prediction of user requirements was proposed in [13]. Query similarity function was used to measure the similarity of the web page with respect to the given query. Authors used 7 queries and 5 functions to test the query. These functions were named as: (i) keyword similarity (simkeyword) (ii) similarity using documents clicks (simclick) (iii) similarity using both keyword and document clicks (simcombined) (iv) query clustering and (v) rank updater. Similarity based on query keywords were used for similarity calculation and clustering the results. While calculating similarity authors considered clicked URLs and Bipartite graph of query log also. Finally the combination of proposed similarity measure and clustering algorithm was used to cluster thequeries.

PROBLEM FORMULATION
The major problems of MSEs [1], [14], [15], [16], [17] are discussed below: a) As almost every MSE just receive the results from multiple search engines & does nothing for presentation of these results. They present the results based on first come first serve basis. So a need arise for the MSE that can provide better presentation of results to user. b) A MSE proposed by [17] have shown good results but calculation in relevancy calculation may take more time which limits the significance of the returned results c) Some available literatures present the results using positional ranking and count function but such type of ranking fails to provide relevant results to user. d) Some MSEs decides number of clusters to be generated in advance. But no method was developed if numbers of returned URLs are more than the expectation. e) Several MSEs like Clusty generate the clusters and named them based on the maximum number of query keyword occurred in the document. But when user going to search these cluster then they found that clusters have nothing as per their name. So there is some problem in deciding the name of clusters also.

PROPOSED ARCHITECTURE
Proposed architecture is drawn in Fig. 1, which uses both ranking and clustering for organizing and presenting web searched results. The description of this architecture is organized in modules with description:

i) Consumer Interface (CI):
CI is the way of interaction to the outer world from where user gives his searched query and gets the desired results.
ii) Relevancy Calculator (RC): RC assigns some relevancy value to each incoming URL of SEs. As there are many methods like VSM, OSM, CDR etc to calculate the relevancy of a returned webpage. But the relevancy calculation in these methods are seems to be complex which can be reduce to some extent. As Naresh Kumar and R.  Nath [17] uses VSM in their literature. They uses number of terms, length of document and some logarithmic calculations but same work can be done by calculating the number of times query terms occurred in the document. Almost same results will be returned by this method which may results in reduction of time and space complexityalso.
iii) Cluster Originator (CO): CO creates the desired number of clusters. The rest process of cluster generation will remain same as Cluster Generation module as explained in [17].

iv) Web Page Setter (WPS):
The main task of WPS is to remove the replica of webpages from the results and sending the ranked web pages to the related cluster. The remaining process of WPS will remain same as WPA explained in [17].

v) Selection of Search Engine (SSE)
: SSE provides the number of search engines to be used for searching the user query. User can select any number of listed search engines. But selection or the uses of more number of search engines may affect the performance of the MSE. So before selection of search engine to be used a user must know the domain expertization of the search engine which may helps the user to get the result effectively and efficiently. It also fastens the processing of MSE.

CONCLUSION:
An effective MSE architecture for better presentation of results has been proposed in this paper. The proposed architecture discussed the problems related with the existing (like Clusty) as well as MSEs available in the literatures. Further a good suggestion for relevancy calculation is also proposed in this paper which may results in reduction of both types of complexity i.e. time as well space complexity. The proposed ideas of relevancy calculation will also helps in detection of more relevant results and improvement in speed of the overall system.

FUTURE WORK
The author is currently working on development of the proposed MSE architecture. Moreover author is also investigating some other issues like reduction of load on the network while using multiple SE with respect to MSE, improvement in presentation of results, naming convention of clustering etc. which are needs to be improved much more. The same will be shaped with completion in near future. The results and conclusion will be compared with the other techniques used by the existing MSEs.