Skip to main content
Log in

A novel feature selection framework for automatic web page classification

  • Regular Papers
  • Published:
International Journal of Automation and Computing Aims and scope Submit manuscript

Abstract

The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward’s minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques, 2nd ed., San Francisco, USA: Morgan Kaufmann, 2005.

    Google Scholar 

  2. M. I. Devi, R. Rajaram, K. Selvakuberan. Generating best features for web page classification. Webology, vol. 5, no. 1, Article 52, 2008.

  3. L. W. Han, S. M. Alhashmi. Joint web-feature (JFEAT): A novel web page classification framework. Communications of the IBIMA, vol. 2010, Artical ID 73408, 2010.

  4. A. Salamat, S. Omata. Web page feature selection and classification using neural networks. Information Sciences, vol. 158, no. 1, pp. 69–88, 2004.

    Article  MathSciNet  Google Scholar 

  5. C. M. Chen, H. M. Lee, Y. J. Chang. Two novel feature selection approaches for web page classification. Expert Systems with Applications, vol. 36, no. 1, pp. 260–272, 2009.

    Article  Google Scholar 

  6. T. Wakaki, H. Itakura, M. Tamura. Rough set-aided feature selection for automatic web-page classification. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, Beijing, China, pp. 70–76, 2004.

    Chapter  Google Scholar 

  7. R. Jensen, Q. Shen. Web page classification with ACO-enhanced fuzzy-rough feature selection. In Proceedings of the 5th International Conference on Rough Sets and Current Trends in Computing, ACM, Berlin, Germany, vol. 459, pp. 147–156, 2006.

    Chapter  Google Scholar 

  8. Q. Shen, R. Jensen. Rough sets, their extensions and applications. International Journal of Automation and Computing, vol. 4, no. 3, pp. 217–228, 2007.

    Article  Google Scholar 

  9. X. Peng, Z. Ming, H. Wang. Text learning and hierarchial feature selection in web page classification. In Proceedings of the 4th International Conference on Advanced Data Mining and Applications, ACM, Berlin, Germany, vol. 5139, pp. 452–459, 2008.

    Chapter  Google Scholar 

  10. M. Farhoodi, A. Yari, M. Mahmoudi. A persian web page classifier applying a combination of content-based and context-based features. International Journal of Information Studies, vol. 1, no. 4, pp. 263–271, 2009.

    Google Scholar 

  11. S. A. Ozel. A genetic algorithm based optimal feature selection for web page classification. In Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, IEEE, pp. 282–286, 2011.

  12. S. Appavu alias Balamurugan, R. Rajaram. Effective and efficient feature selection for large-scale data using Baye’s theorem. International Journal of Automation and Computing, vol. 6, no. 1, pp. 62–71, 2009.

    Article  Google Scholar 

  13. J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244. 1963.

    Article  MathSciNet  Google Scholar 

  14. K. P. Soman, S. Diwakar, V. Ajay. Insight Into Data Mining, India: Prentice Hall, 2006.

    Google Scholar 

  15. The 4 Universities data set. [Online], Available: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/, May 7, 2012.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Alamelu Mangai.

Additional information

J. Alamelu Mangai graduated from Annamalai University, India in 2005. She is a Ph. D. candidate of BITS Pilani, Dubai Campus, UAE, and she has been working as a senior lecturer in the Department of Computer Science in BITS Pilani, Dubai Campus.

Her research interests include data mining algorithms, text and web mining.

V. Santhosh Kumar received his Ph.D. degree from Indian Institute of Science, Bangalore, India. He is currently working as assistant professor in BITS Pilani, Dubai Campus, UAE.

His research interests include data mining and performance evaluation of computer systems

S. Appavu alias Balamurugan received his Ph.D. degree from Anna University Chennai, Chennai, India. He is currently working as assistant professor, Department of Information Technology at Thiagarajar College of Engineering, Madurai, India.

His research interests include pattern recognition, data mining and informatics.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alamelu Mangai, J., Santhosh Kumar, V. & Appavu alias Balamurugan, S. A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput. 9, 442–448 (2012). https://doi.org/10.1007/s11633-012-0665-x

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-012-0665-x

Keywords

Navigation