Machine Learning Technique Based Annotation in Web Database Search Result Records with Aid of Modified K-Means Clustering ( MKMC ) 1

To reduce the memory usage and increase the speed of access in web database, in this study, we have introduced a machine learning technique based annotation with the help of modified K-means clustering algorithm to increase the speed of search result records in web database. The proposed AI based annotation method includes four stages namely, alignment phase, Score Calculation, annotation phase and annotation wrapper generation phase. These four stages of the proposed part are spelt out in this study. The proposed technique is competent to effectively reduce the memory and increase the speed of access in a website. The proposed method is implemented in the working platform of java and the results are analyzed.


INTRODUCTION
Internet takes a major role in the day today life style of human being.Internet also has an essential part of our lives.So the techniques in this are helpful in extracting data present on the web is an interesting area of research (Buchner et al., 1999, Brin andPage, 1998).These internet techniques help to extract information from Web data, wherein at any rate one of structure or usage data is used in the mining process (Borges and Levene, 1998).Web mining is the application of data mining techniques to extract knowledge from web data, including web documents, hyperlinks between documents, usage logs of web sites, etc.
By means of the explosive development of information sources available on the World Wide Web and the rapidly growing pace of espousal to Internet commerce, the Internet has evolved into a gold mine that contains or dynamically generates information that is beneficial to E-businesses (Cooley et al., 1997).A web site has the majority of direct link of a company has to its current and potential customers.These companies can study the visitor's activities through web analysis and find the patterns in the visitor's behavior (Cooley et al., 1999).The web analysis yields the wealthy results for a company's data warehouses, offer great opportunities for the near future.
The web mining is divided into three different categories these are Web usage mining, Web content mining and Web structure mining (Yadav and Mittal, 2013).Web usage mining is the process of extracting useful information from server logs i.e., user's history.It is the process of determine the users, whom are looking on Internet.Various users might be looking at only textual data, while some others might be interested in multimedia data.These technologies are basically concentrated upon the use of the web technologies which could help for betterment (Masand et al., 2002;Spiliopoulou, 1999).Web structure mining, is used to identify the relationship between Web pages linked by information or direct link connection.Spiders scanning of the Web sites are preformed to take place the completion task (Srivastava et al., 2005).Hyperlink hierarchy will establish the related information within the sites to the relationship of competitor links and connection through search engines and third party colinks (Srivastava and Mobasher, 1997).Web content mining is the mining extraction and integration of useful data, information and knowledge from Web page contents (Kosala and Blockeel, 2000).
Web mining is the term of applying data mining techniques to automatically discover and extract useful information from the World Wide Web documents and services (Etzioni, 1996).Although Web mining puts down the roots deeply in data mining, it is not equivalent to data mining.The unstructured feature of Web data triggers more complexity of Web mining.Web mining research is actually a converging area from several research communities, such as Database, Information Retrieval, Artificial Intelligence (Mobasher et al., 2000) and also psychology and statistics as well.Business benefits of web mining affords to digital service providers include personalization, collaborative filtering, enhanced customer support, product and service strategy definition, particle marketing and fraud detection (Abbott et al., 1998).In short, the ability to understand their customers' needs and to deliver the best and most appropriate service to those individual customers at any given moment (Ting and Wu, 2009).
The requirement for predicting user needs in order to improve the usability and user retention of a Website can be addressed by personalizing it (Eirinaki and Vazirgiannis, 2003).Web personalization is defined as any action that adapts the information or services provided by a Web site to the needs of a particular user or a set of users, taking advantage of the knowledge gained from the users' navigational behavior and individual interests, in combination with the content and the structure of the Web site.The objective of a Web personalization system is to provide users with the information they want or need, without expecting from them to ask for it explicitly.Web data are those that can be collected and used in the context of Web personalization (Mulvenna et al., 2000;Srivastava et al., 2000).
Web mining approach to detect users accessing terrorist related information by processing all ISPs traffic is suggested (Elovici et al., 2004).Automatically pages detection in a website whose location is different from where visitors expect to find them (Srikant and Yang, 2001).The key insight is that visitors will backtrack if they do not find the information where they expect the point from where they backtrack is the expected location for the page.
Objective of the study: An amazing system used for storing information which can be accessed through a website is referred to as a 'web database'.A versatile range of purposes are carried out through web database.Therefore, it is important to design a proper database which involves choosing the correct data type for each field in order to reduce memory consumption and to increase the speed of access.Since, miniature databases do not cause any significant problems, gigantic web databases can grow to millions of entries and hence need to be well designed to work effectively.Thus the motive of our research is to reduce the memory and increase the speed of access in a web database by developing a new annotation method.

LITERATURE REVIEW
Some of the recent so far work related to the web mining is listed as follows.Alkhattabia et al. (2011) have proposed an evaluation model for information quality in e-learning systems based on the quality framework.They have also proposed a framework consists of 14 quality dimensions grouped in three quality factors: intrinsic, contextual representation and accessibility in previous.They implemented a goal-question-metrics approach to develop a set of quality metrics for the identified quality attributes within the proposed framework.That proposed metrics were computed to produce a numerical rating indicating the overall information quality published in a particular e-learning system.The data collection and evaluation processes were automated using a web data extraction technique and results on a case study are discussed.That assessment model could be useful to e-learning systems designers, providers and users as it provides a comprehensive indication of the quality of information in such systems.Stevanovic et al. (2012) have inspected the effects of applying seven well recognized data mining classification algorithm on static web server logs.Those effects were examined for the purpose of classify the user sessions as it belonging to either automated web crawlers or human visitors and also identify which of the automated web crawlers 'malicious' behavior and potentially participants in a Distributed Denial of Service (DDOS) attack.The classification performance was evaluated in terms of classification accuracy, recall, precision and F1 score.Seven beyond nine vector features were borrowed from earlier studies on classification of user sessions as belonging to web crawlers.Two novel web session features were introduced i.e., the successive sequential request ratio and standard divergence of page request depth.In terms of the information gain and the gain ratio metrics the effectiveness of the new feature was evaluated.The experimental results of the method showed the potential of the new features to improve the accuracy of data mining classifiers in identifying malicious and wellbehaved web crawler sessions.Velasquez (2013) has presented an integrative approach based on the distinctive attributes of web mining in order to determine which techniques and uses were harmful.The legal framework applicable to privacy affairs between private parties, the most adequate method of protecting users was considered via the contractual remedies.The contract structure was suited to the specific characteristics of the mining project.Two basic categories of web mining projects were defined for the graphical illustration, they are projects based on the mining of web logs with the intention of improving the navigation experience within a certain web site and the use of mining tools upon web data in order to make more complex inferences about an individual's attributes.The first case illustrates the publication of a clear privacy policy which details both the purposes and the pattern extraction techniques.Alternatively the second illustrates the recommendations were substantially different.At last the web miner should obtain automated decisions about individuals regarding topics with a high social impact and not to take care to not use available technology.Arbelaitz et al. (2013) have proposed a system, which combines web usage and content mining techniques with the three principal objectives.The objectives used were creating user steering profiles used for link prediction; inspiring the profiles with semantic information to expand them to offer the Destination Marketing Organizations (DMO) with a tool to initiate links that matched the users flavor and in addition obtaining global and language dependent user interest profiles to afford the DMO staff with important information for future web designs and allows them to design future marketing campaigns for specific targets.That system executed successfully, the obtained profiles vigorous in more than 60% of cases with the real user navigation sequences and in more than 90% of cases with the user interests.In addition the automatically extracted semantic structure of the website and the interest profiles were validated by the (Bidasoa Turismo Website) BTW DMO staff.Lu et al. (2013) have presented an automatic annotation approach for the web mining application.In that approach at first aligns the data units on a result page into different groups such that the data in the same group had the same semantic.An annotation wrapper for the search site was automatically constructed and was used to annotate new result pages from the same web database.From the experimental result they have proved the high effectiveness.

MATERIALS AND METHODS
In our proposed method, first a set of SRRs (Search Result Records) are extracted from a result page from a WDB (Web Database).Once the SRRs are extracted, the similarity of data units (data unit in this a study is referred as a piece of text that represent a concept of an entity) are found for the whole search records based on the five features (data content, presentation style, data type, tag path, adjacency).As soon as the similarity is found for every data units, the data units are aligned in one group which is of the same concept.The alignment here is done by the modified K-means algorithm.Once the data units of same concept is arranged into one group, label is assigned to each data unit using the score value based on title calculation, content based calculation, domain calculation and position calculation and the best label is selected by the ANN (Artificial Neural Network) method for each group.Finally, an annotation wrapper phase is carried out.Annotation wrapper means simply a set of rules designed for each concept, which describes that, how to extract the data units of the same concept in the result page and what is the appropriate semantic label can be for that.Our proposed method structural diagram is presented in Fig. 1.
Given below are some of the search results of book robots.

Alignment phase:
The process carried out in alignment phase are data features extraction and data clustering which is given in the below section.

Data features extraction: Data unit similarity:
The data unit similarity is to found for the search result obtained to align the data units of same concept into a single group.Based on five Fig. 1: Block diagram of the proposed method features (data content, data type and tag path, adjacency and presentation style), the similarity between two data units 1 du and 2 du are found, the similarity between two data units is given by: (2) In the above equation, FV du is the frequency vector of data unit d terms, ȉ˘ˢ ȉ is the length of FV du and the numerator is the inner product of two vectors.

Data type Similarity (Sd):
The data type similarity between to data units 1 du and 2 du is given by the equation: is given by: In the above equation, MS i is the score of the i th style feature.
Is ith style feature of data unit du.

Tag path Similarity (St):
The tag path similarity between two data units 1 du and 2 du is given by the equation: In the above equation, du p and du s are the preceding and succeeding data units of du.

MODIFIED K-MEANS CLUSTERING ALGORITHM
Once the data unit similarity is obtained for all data units using the data unit similarity formula, the data units of the same concept are clustered into a group by means of modified K-means clustering algorithm.The modified K-means clustering algorithm process is described below: • Let us assume that n number of data units is given to the clustering process.The data units are represented as where, N i = The data units t be clustered K j = The randomly chosen centroids • Each centriods values are differenced with the all data units and compared with the user defined threshold value: • The distance values which are less than the defined threshold value are stored in the cluster with the corresponding centroid value • Now recalculate the position of k centroids • Finally, repeats the step 3 and 4 until the centroids become fixed At the end of this process, the data units of same concept are arranged into a group.

Title based calculation:
For each document (link) there must be a title based on which the calculation is carried out as detailed below: After separating the query words and finding the meanings for all of them, we compare them with the titles of the unique links separately to find the frequency of the words, which is shown in Table 1.Table 1 explained as follows: the du 1 , du 2 , …., du n represents the separated words of the query we have given and 'n' represents the number of separated words in the query and du 1 N1, the first meaning N of the respective word du 1 and du n Nb, the b th meaning N of n th separated word du of the query we have given.The Ti du 1 represents the number of times the first separated word present in the title of the first unique link whereas TE1 du 1 N1 represents the number of times the first meaning of the first separated word present in the title of the first unique link.The title based calculation for each unique link is shown by an equation below: is the number of occurrences of i th query word in the title TE of s th link, max (TEdu i ) is the maximum number of occurrence of i th query word in the title of whole unique links, is the number of occurrence of j th meaning of i th query word in the title TE of s th link, ( ) is maximum number of occurrence of j th meaning of i th query word in the title of whole unique links, n is the total number of query word and b, the total number of meaning of i th query word, w Q the weight value of the query word and w N is the weight value of the meaning word of the query word.

Content based calculation:
In the content based calculation we compare the contents of each link with the separated query words and their synonyms to check the number of occurrences of separated query words and their synonyms in the contents of each link.Table 2 shows the number of occurrences of query words and their synonyms in the contents of each link.Table 2 explains as follows: the CE 1 , CE 2 , ……., CE s represents the contents of the unique links and Ces du n represents the number of occurrence of n th query word in the content of s th unique link and CEs du n Nb, the number of occurrence of b th synonym of n th query word in the content of s th unique link.The calculation based on content is shown by an equation below: is the number of occurrence of i th query word du in the content of s th unique link, ( ) is the maximum number of occurrence of i th query word du in the content of s th unique link, CE s du i N j , the number of occurrence of j th synonym of i th query word du in the content of S th unique link; and ( ) is the maximum number of occurrence of j th synonym of i th query word du in the content of s th unique link.Domain calculation: Each link we have obtained from the different search engines invariably comes under a specific domain name.An example for such domain name is 'Wikipedia'.We calculate the domain value for each unique link using the domain name we found for each link in the different search engines.The equation to calculate the domain value for each unique link is given below: In the above equation, ( ) p DO s represents the calculated domain value of s th unique link and m, the number of search engines we used; and acc s , the number of unique links with same domain name.For example if we are having ten unique links out of which five are from same domain, while checking any one of the link from those five unique links which are under same domain, the acc s value is five for that related unique link.
Position calculation: This calculation is based on the ranking of the link in different search engines we have used i.e., the link present in the position in each search engine which we chosen for our process.The formula to calculate the position of a link is shown below: In the above equation, PS s (p) represents the position value of the link and m, the number of search engines used, k is number of links we have taken for our process from each search engine; and PS (p), the rank of a link in a particular search engine.
Annotation phase: Annotation phase is carried out by neural network training process which is detailed in the below section.

Neural network training:
Once the title based calculation, content based calculation, domain calculation and position calculation are found and the labels are assigned using this score value and the appropriate label is found out using ANN method.The given Fig. 2 shows the neural network of our process.Figure 4, the output of node (neuron) K is formed from the neurons I and J. Let K be the output layer and I, J, hidden layers.First we calculate the error in the output from K. It is calculated based on the equation below:

(
)( ) In the above equation, er K represents the error from the node K, O K , the output from the node K and T, the ( )( ) After obtaining the error for the hidden layer, we have to find the new weight values in between input layer and hidden layer.By repeating this method we train the neural network.Subsequently, we give the query to the system which merges the unique links from the different search engines and ranks the unique links based on the trained neural network using the score generated in the neural network for each unique link.Figure 4 shows the algorithm of our proposed technique.
Annotation wrapper: After selecting the best label for the given query in a WDB by ANN process, the annotation wrapper process is carried out.Annotation wrapper is nothing but a set of annotation rules for all the attributes on the result page with order corresponding to the ordered data unit groups.The annotation rule is given by: To use the wrapper to annotate a new result page, for each data unit in an SRR, the annotation rules are applied on it one by one based on the order in which they appear in the wrapper.If this data unit has the same prefix and suffix as specified in the rule, the rule is matched and the unit is labeled with the given label in the rule.Annotation wrapper is created so that the new search result record can be annotated by this process without reapplying the entire annotation process.

RESULTS AND DISCUSSION
The proposed method is implemented in the working platform of java.The performance of the proposed method is compared against the performance of the existing method.The performance for proposed method and existing method is evaluated for various domains (music, job, book, game and movie) using various annotators and various calculations.From the given below results which is given in Table 3 and 4, we can analyze the performance of the proposed method.
Discussions: Table 3 and 4 illustrate the performance of the proposed method and the existing method.Table 3, FA, QA, IA and CA represent the frequency annotator, query annotator, In-text prefix/suffix annotator and common knowledge annotator.Table 4, TC, DC, PS and CC represent the title based calculation, domain based calculation, position based calculation and content based calculation.Table 3 the average performance of 4 annotators namely frequency annotator, query annotator, In-text prefix/suffix annotator and common knowledge annotator is given which is compared against the performance of the proposed method namely title based calculation, domain based calculation, position based calculation and content based calculation.From the values given in the tables, two graphs are plotted for the comparison of the performance of the proposed method and the existing method by taking its precision and recall values.From the graph is given in Fig. 5 and 6, it is clear that the precision and recall of our proposed method is higher than the existing method in all the cases.Thus the performance of the proposed method performs better than the existing method.Once the precision and recall of the proposed method is higher than the existing method, it is well capable of labeling the records in search engine and thus reducing the time consumption in searching a particular file.

CONCLUSION
The main aim of our study is to provide a better annotation method for web database records.Although there are various annotation methods exists in the literary works, a better performance of annotation is needed in the current situation, since there are millions number of entries in the current record and also which is increasing day by day.Hence, I have intended to propose a new annotation method with AI technique that performs the annotation with different number of training sites.The results are taken for the proposed method and the existing methods and the performance is analyzed.The SRRs (Search Result Records) from different websites are taken and annotated using the proposed and the existing method.The precision and recall of the results are taken as the output for the proposed method and the existing method.From the results, the performances of both the proposed and existing methods are analyzed.As seen from the result, in most cases, the performance of the proposed method is better than the performance of the existing method for both precision and recall.Thus, we can conclude that the proposed method is well capable of annotating the web database records.
types of data type e. Presentation Style (Sp): Presentation style consists of six style features.They are font weight, font size, font color, font face, text decoration and italic font.The presentation style between two data units 1 du and 2 du based value of s th unique link; and i s du TE

Fig. 2 :
Fig. 2: Example search results from amazon.com Figure 3, W xy represent the weight values between the input layer and the hidden layer, W yz , the weight values between the hidden layer and the output layer and O, the output of the neural network.The neural network is trained based on the weight values which are adjusted as per the error we have obtained.The error is calculated by checking the difference between the target value and the output obtained using neural network.The target value is based on the user ranked list and the weight values on the back propagation algorithm.It is explained as follows: initially the weights in the neural network are random numbers and the output from the neural network for the given input is based on the weight values.Figure 3 shows a sample connection in neural network for learning back propagation algorithm.Figure4, the output of node (neuron) K is formed from the neurons I and J. Let K be the output layer and I, J, hidden layers.First we calculate the error in the output from K. It is calculated based on the equation below:

Table 1 :
Frequency of query words and meanings of query words in the title