Performance of Using Tag-based Feature Sets in Web Page Classification

As the Web is a large collection of data growing daily, an automatic Web page classification mechanism is needed to effectively reach to useful information. Majority of the Web pages are in the form of HTML documents, therefore the aim of this study is to explore the effect of HTML tags on classification process, and try to determine the most valuable HTML tags for feature extraction of the classification task. To achieve this goal, we employ 13 different datasets, and use 5 popular classifiers that are SVM, naive bayes (NB), kNN, C4.5, and OneR. The statistical analysis shows that, the features extracted by using solely the anchor, or tags can be used as an alternative to the features extracted from the whole Web page. SVM is the best among the classifiers used in this study. Using the HTML tags for feature extraction improves classification accuracy.


Introduction
The Web is a large collection of documents of various kinds.Many people use the Internet to find and gather information on certain topics.However, it is not easy to reach to a desired information by using the standard search engines.Possible reasons for this problem are [1]; 1.The Web pages are increasing exponentially, hence, it is difficult to keep the index of search engines up-to-date.2. When a user seeks information on a search engine, too many irrelevant pages containing search terms are presented.
In order to overcome these search problems, accurate classifiers which can assign correct class labels to Web pages are needed [2].
Nowadays most of the Web pages are written in HTML (Hyper Text Markup Language) which consists of tags indicating the structure of texts.Those pages not only include plain texts but also hyperlinks and multimedia information (i.e., images, animations, sounds).Because of this complex structure, the Web page classification confronts more difficulties and more challenges than the text classification [3].In this study our aim is to investigate the effects of using HTML tags on classification performance of Web pages.Majority of the previous studies that have been done for the Web page classification have ignored HTML tags and tried to solve this problem as a plain text classification problem.However, only some of the studies [1, 2, 4 -16, 24 -29] have used feature extraction methods which involve HTML tags.
Although HTML tags are considered during feature extraction, none of the previous studies have made an extensive analysis on the effect of each HTML tag separately.
There are in principle three kinds of HTML tags: logical, physical, and meta-tags [4].The physical tags are related to the formatting of the text, such as, bold or italic; the logical tags have richer semantic imports like, headlines or anchors; and the meta-tags give information about a document [4,5].Thus, as a whole, these tags provide information about the content of a document.Unfortunately, the HTML tags are usually omitted in many researches [17,18,19].
Those studies count only the frequencies of terms in Web pages, without making any distinction with respect to the HTML tags and this feature extraction approach is called as "bag-of-words" [6].
In this work, we use only logical and physical tags and omit meta-tags, because in majority of Web pages, meta-tags often include terms that are not related to the content of the Web page to increase ranking score assigned by the search engines or they are left empty [20,21].We focus on the text content of Web pages, and do not consider hyperlink structure and multimedia information.We extract features only from a set of logical and physical HTML tags by using each stemmed term with its associated HTML tag as a feature, therefore identical terms in different tags are deemed as different terms.This kind of feature extraction algorithm is named as "tagged-terms" method in [6].In this study, by using the tagged-terms method, we investigate the effect of each tag on the Web page classification accuracy.We compare the performance of the Naïve Bayes (NB), decision tree (C4.5), k-nearest neighbour (kNN), rule based (OneR), and support vector machine (SVM) classifiers on different feature sets that are extracted by using different HTML tags; and repeat experiments on different datasets to find out the effects of different HTML tags on classification accuracies.
Web page classification/categorization is the process of assigning a class label to the Web pages from a set of predefined categories [22].Web page classification is a kind of text classification task however; it has been demonstrated in many studies that, using the information derived from HTML tags can increase the classifier's accuracy.In an early study, Golub and Ardo [7] determined the significance of different parts of a Web page for automated classification.They used four elements of a Web page: title, headings, metadata, and main text.The experimental analysis showed that using all of these elements is necessary for automated Web page classification since, only some of these elements occur at the same time on Web pages.
Later, Ru and Horowitz [23] presented a method for automated classification of HTML forms.Algorithms have been developed for automatic feature generation from HTML forms and a neural network has been applied for classification.For the feature extraction <form> tag is used and high classification accuracy is observed.
Another study that involves HTML tags for Web page classification belongs to Yang, Slattery, and Ghani [8] who have concluded that the HTML tags in hypertext pages improve classification performance when considered jointly with the text contained in the Web pages.In [24], it is demonstrated that, SVM classifier using the text on the target page, page title, and anchor text from parent pages can improve classification compared with a pure text classifier.
Fresno, Martinez, Montalvo and Casillas [9] have proposed a NB based Web page classification system which uses HTML mark-up information to find the term relevance in a Web page.The experiments showed that, gaussian models give better accuracy than event models when enriched representations are considered.
According to [2], a new feature set, which is the hierarchical structure of headings appearing in the Web page, enhances the classification performance.
The weights for the words appearing in the heading tags are assigned related to their hierarchy.As a result, it has been found that the hierarchical structure of headings has a high impact and could improve the classification performance.
Kim and Zhang [5] proposed a method to learn the internal structure of HTML documents by using genetic algorithms.The proposed algorithm learns the important factors of the HTML tags which are then used to re-rank the documents retrieved by standard weighting schemes.The results indicate that the proposed approach significantly improves the performance of retrieval accuracy.
Xue, Bao, Huang and Lu [3] studied several key aspects of the SVM for Web page classification.For feature extraction, a set of commonly used features of Web pages, such as body, title, headings, and metatags are used.They have concluded that composite of plain text and HTML structure gives better classification performance.
Werner, Böttcher and Beckmann [4] presented an approach which uses the HTML tags to improve the quality of the classification.The developed classification system uses changes in the typographical style of an HTML document.Therefore, one can detect the parts of the document that is emphasized by the HTML page developer.These emphasized parts are weighted stronger, which leads to significant improvement on the classification of documents.
In another study [6], a genetic algorithm (GA) based Web page classification system has been developed which uses both the HTML tags and stemmed terms belong to each tag as features for classification.The proposed system learns the best weights for each feature by the GA, and the experimental evaluation showed that, using the HTML tagged-terms as features increases the classification accuracy with respect to using terms alone.
Belmouhcine et.al [10] proposed an approach which classifies Web pages by using plain text and text between the HTML tags.In the first step of the method an SVM implementation is used to generate a reduced vector representation based on plain text and text from the HTML tags.Then in the second step, the NB algorithm is used to determine the class of the Web page.The experiments showed that, using the combination of HTML tags with plain text increases the performance of NB classifier.
Saraç and Özel [11] used firefly algorithm in order to find the best features for Web page classification.The features are extracted from URL and <title> tag, and the Web pages are classified without loss of accuracy.
In another study of Saraç and Özel [12], ant colony optimization algorithm has been applied to reduce the number of features used for Web page classification.After the experimental evaluations it is concluded that, using the URL and <title> tags for feature extraction gives a good classification performance with respect to that of using the bag-ofwords method.
In [13], Meshkizadeh and Rahmani illustrated that using the HTML tags and URL features of a Web page along with features of sibling pages, and NB as a classifier, could increase the classification accuracy.Jeong et al. [14] developed a method for extracting the title of a Web page by using anchor tags.They verified that by using anchor tag information, the accuracy of the classifier increases.
Bhalla and Kumar [24] employed HTML tags to extract features from Web pages and applied SVM for classification.Experimental evaluation showed that the tag based feature extraction gives satisfactory performance.
Navadiay, Parikh and Patel [25] focused on the Web page classification based on a combination of the content and the structure of a Web page.They used the same feature extraction algorithm as that used by Özel [6].The results indicated that the NB is good for Web page classification when combination of HTML tag and term is used as features.
In [26], Sarhan, Hamissa and Elbehiry proposed 2 algorithms which they called "Important HTML tags only algorithm" and "Weighted Important HTML tags only algorithm".They compared these algorithms with the traditional feature selection algorithm (i.e. using bag-of-words).They used two famous classifiers SVM and NB to classify the Web pages by using the features selected by employing these algorithms.As a result, they showed that using the proposed algorithms improves the accuracy of the classifiers.
In our recent study [15] we used 6 HTML tag sets in the tagged-terms feature extraction method and performed experiments on 9 different datasets using 4 classifiers.We concluded that C4. 5  In a more recent study [27], Thanasopon et.al focused on text mining and they aimed to detect the most popular online trends.While extracting the topics, they used TF-IDF and HTML score.They assumed that words in certain tags are more related to the main concept than the others.For this purpose, weight of words in these tags such as <h1> and <b> are increased.By using this term extraction method, they conducted experiments on a popular discussion forum and concluded that SVM classifier outperformed other classifiers.
As summarized above we have evaluated most of the previous studies related to Web page classification that involve HTML tags.Although physical HTML tags are generally used to form the appearance of text on a browser, they provide important clues about the topic, theme, and genre of the Web page as shown in the previous studies.Therefore, utilizing HTML tags for classification of Web pages improves accuracy of the classifiers as proved in the previous studies [4,6,13,26].However, except our recent study, none of the previous studies have made a comparison among HTML tags to use in feature extraction.In this study, our aim is to make an extensive experimental evaluation on the effects of each HTML tag over Web page classification and try to determine which HTML tag(s) should be considered for feature extraction.To reach our goals we use 13 distinct datasets, whereas the other studies have used only a few datasets.We investigate the effects of each <title>, <h1>, <h2>, <h3>, <a href=…>, <em>, <strong>, <b>, <i>, <p>, and <li> tags and compare them with the traditional bagof-words and tagged-terms methods.We repeat our experiments with five classifiers that are SVM, NB, C4.5, kNN, and OneR to also show the combined effects of classifiers and HTML tags, however previous studies have used only a few classifiers.We perform statistical analysis to determine best methods.To our knowledge, no one has applied such statistical methods in their studies.Therefore our study will be helpful to researchers and practitioners who work in the area of Web page classification and information extraction from Web pages by indicating which HTML tags can give more valuable features for classification, which classifier performs better, and the interactions between feature extraction methods and classifiers.
The rest of the paper is organized as follows: In the following section, we describe our feature extraction method, the datasets used in the experiments, and the evaluation metrics.Section 3 presents the experimental results and discussions on them.Finally, Section 4 gives conclusions and some future works that we plan to perform.

Material and Method
The block diagram of the applied methods in this study is presented in Figure 1.As shown in the figure, each dataset used in this study is first partitioned as train and test sets.We use training dataset to extract features and learn a classification model.The extracted features are then used to compute document vectors for each Web page in the training and the test sets.After these steps, the documents in the test set are assigned class labels by using the learned classifier.Finally, accuracy of the classification task is computed.These steps are repeated for each feature extraction method, classifier, and dataset four times as we apply 4-fold cross validation.After that we apply statistical analysis to show the effects of using HTML tags in classification.The details of each method and datasets used in this study are explained in detail in the following subsections.

Datasets
In this study we make binary classification as it is used by many focused crawlers to improve search performance of search engines.Binary classification tries to determine whether a Web page is in the class of interest or not.Therefore we prepare 13 binary classification datasets from the publicly available WebKB, Benchmark, Syskill Webert datasets, as well as manually collected Conference dataset.We apply 4 fold cross validation.The number of instances for the first fold for each dataset is listed in Table 1.These numbers are very similar for the other folds and to save space they are not listed in this paper.The details of each dataset are given in the below subsections.

Conference dataset
The Conference dataset consists of the Computer Science related conference homepages.This dataset is manually collected and used in [6,28].The names of the conferences in the dataset are obtained from the DBLP Computer Science Bibliography (http://www.informatik.uni-trier.de/~ley/db/) and then these names are queried by using the Google search engine (http://www.google.com).The conference homepages in the query results are labelled as positive documents; and the pages that include similar information with conference homepages but are irrelevant as negative documents.Then, all the positive and negative documents are randomly distributed among training and test sets.The Conference dataset contains 2369 Web pages in total (824 positive, 1545 negative documents).

WebKB dataset
WebKB dataset was prepared by the WebKB project at CMU [29].The dataset consists of Web pages collected from Cornell, Texas, Washington, and Wisconsin Universities; and the pages are classified into seven categories.We use a subset of the WebKB dataset (i.e., only the student, faculty, course, and project category pages) because these categories have more instances than the remaining.For each category, we generate a binary classification dataset, therefore we obtained Course, Student, Faculty, and Project datasets.For each dataset, we use "others" category of WebKB dataset as negative class instances.As an example the Course dataset contains Computer Science related course homepages and some irrelevant Web pages from the "others" category of WebKB and has 4694 Web documents in total.4-fold cross validation is applied as described in the WebKB project Web site [30].

Benchmark dataset
The Benchmark [31] is a dataset of 11,000 Web documents pre-classified into 11 equally-sized categories, each containing 1,000 Web documents.It was generated by Sinka and Corne, with the main aim of proposing a general dataset for Web document clustering and similar experiments.The Benchmark dataset consists of four main themes namely "Banking & Finance", "Programming Languages", "Science", and "Sport".From each theme, we chose one class.These are "Commercial Banks", "C/C++", "Biology", and "Motor Sport".Negative pages are selected randomly from the rest of the seven classes.Therefore we obtain "Commercial Banks", "Programs", "Biology", and "Motor Sport" datasets, each containing 4500 documents in total.Then, we apply 4-fold cross validation.

SyskillWebert dataset
SyskillWebert dataset [32] has a similar structure with WebKB dataset.It contains HTML source of Web pages.The Web pages are on four separate subjects that are Bands (recording artists), Goats, Sheep, and Biomedical.All of the four subjects are involved in our study and 4-fold cross validation is applied.

Proposed feature extraction methods
As our aim is to evaluate the effect of each HTML tag on the performance of Web page classification and to determine which HTML tag covers valuable features, we use the terms that are surrounded by HTML tags as features, and propose 8 feature extraction methods.
We use <title>, <h1>, <h2>, <h3>, <a href=…>, <em>, <strong>, <b>, <i>, <p>, and <li> HTML tags, as well as the text content to extract features.We choose these tags because in [1, 2, 4 -10, 13, 14] it is observed that these tags include the most useful information.We group some of the related tags given above in order to reduce the feature space.The tags <h1>, <h2>, <h3> are grouped together as "header"; <b>, <strong>, <i>, <em> are grouped as "bold"; <p> and text content are grouped as "text" features.We take <a href= >, <li>, and <title> tags separately and call them as "anchor", "list", and "title" features, respectively.Therefore we have 6 HTML tags (or tag groups) that are used for feature extraction.For each of these HTML tags or tag groups, all the terms that belong to each tag or tag group are taken; the stopwords are removed from the extracted terms; Porter's stemming algorithm [33] is applied; and each stemmed term for each tag or tag group forms a feature.Therefore, we collect anchor, bold, header, title, list, and text feature sets for each dataset; and use each feature sets separately.
In the seventh feature extraction method, we also use these term-tag pairs to form another feature set named as tagged-terms.Apart from using each tag group alone, we use all the terms from all the tag groups such that a term can be in the feature list several times because every term is used with its corresponding tag (i.e., the word "course" in the <title>, <li>, and <b> tags are considered as different features).
Finally, we use the bag-of-words method to form a different set of features.In the bag-of-words method all the HTML tags are removed and the remaining pure text is used.In this method there is no distinction between the words with respect to HTML tags.As in the tagged-terms feature extraction method the stopwords are removed, and the remaining terms are stemmed according to Porter's stemming algorithm [33].
All of the above mentioned feature extraction methods are applied to the positive instances in the training part of each dataset.As most of the datasets used are not balanced, and we have higher number of negative instances, extracting features from positive instances give higher accuracy and produces lower number of features as we observed in our previous study [6].After extracting features as described above, document vectors for the training and the test sets are created by using the term frequencies in the associated Web page.Then these document vectors are normalized according to the document lengths.

Feature reduction
The number of features obtained by using the proposed methods is very large for some datasets.As an example we extract approximately 50000 features for the Student dataset when tagged terms feature extraction is applied.Therefore we apply document frequency filtering to reduce the feature space.According to Salton [34], the most useful terms are the ones having document frequencies between 1% and 10% due to the fact that low document frequency terms are generally misspelled ones, and high document frequency terms are often stopwords.For this reason we removed features having document frequency less than 2% for each dataset to eliminate misspelled terms.We determined this threshold experimentally.
The numbers of features obtained by each feature extraction method after the document frequency filtering is applied are presented in Table 2.As an example, the "Title" column in Table 2 gives the number of features extracted only from <title> tags of the Web documents in the training set of each dataset having document frequency greater than 2%.The values given in the table belong to the first fold of each datasets.The numbers of features obtained for the other folds are similar, and to save space, they are not included in this paper.
As our aim is to measure the effect of each tag separately, we classify each dataset by using the features extracted from the above mentioned 8 feature extraction methods, and compare the results.

Classifiers
In our study, five different classifiers, namely Naïve Bayes, decision tree, k-nearest neighbor (kNN), rule based, and support vector machine (SVM) are used to show the effect of the HTML tags.For implementation, we use the WEKA-package [35].As Naïve Bayes classifier we employ Naïve Bayes Multinomial (NBM) since it performs better than ordinary Naïve Bayes model for document classification [36].We use LibSVM package for SVM classifier, choose linear kernel as we have high dimensional feature space [37], and used the default parameters for each dataset.
For decision tree classifier, we apply J48 which is an implementation of C4.5 algorithm.For kNN, we employ IBk with k=1 for each dataset; and finally for rule based classifier we use OneR from the WEKA package.Among the used classifiers NB and SVM have been used in the majority of the Web page classification studies such as [3,15,19].In [6,11,12,28] it has been showed that kNN, decision tree, and rule based classifiers also have high accuracy for Web page classification.Therefore we include all these five classifiers in our study.

Evaluation metric
In our experiments the F-measure, which is commonly used metric [1,2,6,38,39], is employed for performance evaluation.The F-measure is a harmonic mean of the precision and the recall of the test and it is defined as: where, precision and recall are computed as; Given two classes, positive documents are the documents of the main class of interest (e.g., class C1), and negative documents are the documents that do not belong to the main class of interest (e.g.,Not C1).According to these definitions "TruePositives" means the positive documents that are correctly labelled by the classifier, and "FalsePositives" ("FalseNegatives") are the negative (positive) documents that are incorrectly labelled [39].

Statistical analysis
The statistical analyses are performed by using SPSS.The F-measure values for the methods are summarized as mean and standard deviation.Repeated Measurement analysis is used for comparing the F-measure values of the methods.To assess the effect of using the HTML tags on classification accuracy, analysis of variance (ANOVA) is used.The well-known Bonferroni test is applied for pairwise comparisons.p<0.05 is accepted as statistically significant.

Results
In this section, experimental results that include the effects of feature extraction methods and classifiers are presented.From the statistical analyses applied, we try to conclude which classifier and feature extraction method have the best performance for each dataset in specific and for all datasets in general.

Time to build and test the classification models
We measure the total time required to train and test the classification models for each dataset.Table 3 gives the average running time of 4-folds for the Conference dataset.Experiments were done on a hardware which has 8 GB of RAM and Intel® Core™ i7-2600 3.80GHz processor.To save space, average running times for only one dataset are presented in this subsection.Similar trends were observed for the remaining datasets.
As seen in Table 3 the running times change depending on the number of features and classifiers used.As it is expected, when one employs a feature extraction that yields small number of features, running time decreases sharply.We should also point out that among the classifiers we have tested, IBk is the slowest classifier since it is a lazy method.

F-Measure values of classifiers for each dataset
For each feature extraction and classification methods, average F-measure values for 4-fold cross validation on each dataset are given in Figure 2, where the x-axis shows the feature extraction methods, and y-axis gives the F-measure values.
As shown in Figure 2 ("i" through "xiii"), the best feature extraction method and the best classifier can change for different datasets, however using anchor, title, text, bag-of-words and tagged-terms feature extraction methods produce the best classification performance as shown by the statistical analyses given in the consequent subsections.

Comparison of classifiers
When the mean of all the F-measure values obtained from different feature sets are taken into consideration, the average F-measure values for the classifiers can be calculated.The results are given in Table 4. Based on the p value given in Table 4, there is a significant difference among the classifiers (p<0.001).As a result of pair-wise comparisons between the classifiers; the LibSVM classifier performs better than the IBk, OneR, and NBM classifiers.
We used the NBM classifier since it performs better than NB implementation in WEKA for document classification [36].The results of our experiments, applying both NB and NBM classifiers of WEKA to all the datasets, have supported that conclusion.We have got an overall of 0.755(±0.127)classification accuracy for the NB classifier, on the other hand as seen in Table 4 it is 0.841(±0.085)for the NBM classifier.This result is also compatible with that of [40] which compares the NB with the SVM for text classification and applies some corrections to improve the performance of the NB classifier.
Although corrections applied to the NB classifier had improved its text classification accuracy, the corrected version also had worse classification performance than the SVM [40] as we observe in this experiment.
When we examine the classification accuracy of the rule based classifier (OneR), we observe that it has the best performance with the data obtained from the WebKB dataset (see Table 5), this result occurred due to the fact that in the WebKB dataset, class specific terms like "course", "student", "faculty", "project" occur with the HTML tags as well as in the text, so OneR can easily find these terms and generate rules which involve class specific terms to classify the Web pages.However, as we repeat the experiments for the other datasets, the overall performance of the OneR classifier reduces, and becomes worse than the LibSVM, J48, and NBM.The J48 has been found to be the second best performer classifier, and this conclusion is compatible with our previous experiments [11,12,15,28], where we had found that the J48 performs better than the NB and IBk.
The IBk has good performance in approximately 50% of the datasets, however, as it is a lazy approach it has high testing time as shown in Table 3, and its overall classification accuracy is not as good as the LibSVM, J48, OneR and NBM.parameter settings you may get better results with LibSVM.Alternatively, the J48 can be used instead of the LibSVM as can be seen from Table 4.

According to
On the other hand, when one compares the running times of the classifiers it is seen that the LibSVM has similar running times to that of the J48, and both are extremely faster than the IBk (Table 3).

Effect of classifiers and feature extraction methods on each dataset
Table 5 summarizes the datasets, and the corresponding classifiers, and the feature extraction methods that give the tabulated best F-Measure values.According to the analysis presented in Table 5, for most of the datasets, the bag-of-words or the tagged-terms methods give the highest F-measure values when used with the LibSVM or the J48 classifiers.
For Course, Project, and Student datasets, using features extracted from the anchor, bold and title tags give the highest F-measure values when used with OneR and IBk classifiers as these tags include class specific terms for these datasets.

Effect of feature extraction methods on classification accuracy
The mean of all the F-measure values obtained by using different classifiers when the corresponding feature extraction method is taken into consideration is given in Table 6.The p value given in the table indicates that there are significant differences among the feature extraction methods (p<0.001).As a result of pair-wise comparisons between the methods one can conclude that the title, anchor, text, bag-of-words and tagged-terms feature extraction methods perform better than the bold, header, and list feature extraction methods (see Table 6).The feature sets formed by the bag-of-words or the tagged-terms methods have large number of features (see Table 2).Thus, using these features decrease runtime performance of the classifiers (see Table 3).Therefore, to choose tag based feature sets can be more appropriate for large datasets.According to the pairwise comparisons done between the feature extraction methods; the feature sets formed by using only the anchor tag, text tag, or title tag can be an alternative to the feature sets of the bag-of-words, and tagged-terms methods for large datasets.

Effect of feature extraction methods on classifiers
For the tagged-terms and the bag-of-words methods, the differences between the F-measure values of the IBk and the LibSVM classifiers are statistically significant (p=0.029 and p=0.019 respectively).If these methods are used in classification, then one will get better classification accuracy from the LibSVM classifier than the IBk classifier (see Figure 3).
For the remaining feature extraction methods, the differences between the F-measure values of the classifiers are not statistically significant.When these feature extraction methods are used, there is no difference in using the IBk, J48, NBM, OneR, or LibSVM from statistical point of view.However, for each feature extraction method one or two classifiers can be chosen numerically according to their F-measure values (see Figure 3).The feature extraction method and the corresponding classifier that best suits the method are given in Table 7.

Discussion and Conclusion
In this study we used both the HTML tags and the stemmed terms that belong to each tag, and also all the terms from the Web pages as classification features.We performed our experiments on 13 datasets with 8 feature extraction methods and repeated the experiments with 5 different types of classifiers using 4-fold cross validation to explore the effects of using HTML tag based features on classification accuracy.First of all, we compared classification performances of classifiers.When all the F-measure values are taken into consideration, the SVM classifier seems to be the best choice in terms of classification accuracy and time.
The results of the statistical analysis show us that different feature set-classifier couples give higher classification accuracy for different datasets.But, we have also observed that in most of the datasets the bag-of-words or the tagged-terms methods give the highest classification accuracy when used with the SVM or the decision tree classifiers.
According to pair-wise comparisons of the feature sets; the anchor, tagged-terms, title, text, and bag-ofwords feature sets perform better than the feature sets formed by using the bold, header, and list tags.The tag-based feature sets (anchor, bold, header, list, title, and text feature sets) have smaller number of features than the tagged-terms and bag-of words feature sets, and thus using these sets improves the runtime performance of the classifiers.According to pair-wise comparisons between the feature sets, using only the anchor tag, text tag or title tag can be an alternative to the bag-of-words and tagged-terms methods.
When the effect of tags on the classification accuracy is examined, it is seen that features extracted by the bag-of-words and the tagged-terms methods give better results mostly with using the SVM classifier than the kNN classifier.On the other hand, there is no difference in using the kNN, decision tree, NB, rule based, or SVM classifiers when the other feature sets are used for feature extraction.
By this study, apart from the works done on web page classification [1,3,9,17,18,19,23], we have the chance to emphasize which tag gives better performance when used in a feature extraction method.Our results are compatible with the studies in which the HTML tags have been used and tried to show the impact of them [2, 4 -8, 10 -15, 24, 25, 28, 29].Our study has also proven the positive impact of using the HTML tags on classification accuracy.The title, anchor, text, and tagged-terms feature sets give better performance in many cases than the bag-ofwords feature set.
As a future work, we plan to examine the combined effects of the HTML tag sets as a comparison to the results of this study.Furthermore, the experiments may be repeated for multi-class classification, and some other classifiers like Random Forests, and Maximum Entropy may be applied.

Figure 1 .
Figure 1.Block diagram of the applied methods i) F-measure values for Course dataset ii) F-measure values for Faculty dataset iii) F-measure values for Project dataset iv) F-measure values for Student dataset v) F-measure values for Biology dataset vi) F-measure values for Commercial Banks dataset vii) F-measure values for Programs dataset viii) F-measure values for Motorsport dataset ix) F-measure values for Bands dataset x) F-measure values for Biomedical dataset xi) F-measure values for Goats dataset xii) F-measure values for Sheep dataset xiii) F-measure values for Conference dataset

Figure 2 .
Figure 2. F-measure values for each feature extraction method and classifier for all datasets.

Figure 3 .
Figure 3.Effect of tags on classifiers.

Table 1 .
Number of documents in the datasets.

Table 2 .
Number of features after reduction for each dataset

Table 3 .
Average running time for Conference dataset.

Table 4 ,
one can conclude that the LibSVM with linear kernel and default parameter settings is the best among the classifiers used in the experiments.Moreover, if you apply optimal

Table 4 .
F-measure values of classifiers

Table 5 .
Datasets, the corresponding classifiers, and feature extraction methods giving the best F-measure values * TT= Tagged-terms, BW= Bag-of-words

Table 6 .
F-measure values according to feature extraction methods

Table 7 .
Feature extraction methods and the best corresponding classifiers