Character Variable Numeralization Based on Dimension Expanding and its Application on Text Classification

Xu, Li-xun; Yu, Xu; Wang, Yong; Feng, Yun-xia

doi:10.1007/978-981-10-2053-7_22

Li-xun Xu²⁰,
Xu Yu²¹,
Yong Wang²² &
…
Yun-xia Feng²¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 623))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1308 Accesses

Abstract

The character variable discrete numeralization destroyed the disorder of character variables. As text classification problem contains more character variable, discrete numeralization approach affects the classification performance of classifiers. In this paper, we propose a character variable numeralization algorithm based on dimension expanding. Firstly, the algorithm computes the number of different values which the character variable takes. Then it replaces the original values with the natural bases in the m-dimensional Euclidean space. Though the algorithm causes a dimension expanding, it reserves the disorder of character variables because the natural bases are no difference in size, so this algorithm is a better character variable numerical processing algorithm. Experiments on text classification data sets show that though the proposed algorithm costs a little more running time, its classification performance is better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cheng, Y.C., Wang, P.C.: Packet classification using dynamically generated decision trees. IEEE Trans. Comput. 64(2), 582–586 (2015)
Article MathSciNet Google Scholar
Qiu, C., Jiang, L., Li, C.: Not always simple classification: learning SuperParent for class probability estimation. Expert Syst. Appl. 42(13), 5433–5440 (2015)
Article Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)
MATH Google Scholar
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Article MATH Google Scholar
Bai, L., Wang, Z., Shao, Y.H., et al.: A novel feature selection method for twin support vector machine. Knowl.-Based Syst. 59(2), 1–8 (2014)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Wajeed, M.A., Adilakshmi, T.: Different vectors generation techniques with distributed features for text classification using KNN. In: 2012 1st International Conference on Recent Advances in Information Technology (RAIT), pp. 482–486. IEEE (2012)
Google Scholar
Sun, A., Lim, E.P., Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)
Article Google Scholar
Cai, Z., Zhang, T., Wan, X.: A computational framework for influenza antigenic cartography. PLoS Comput. Biol. 6(10), e1000949 (2010)
Article Google Scholar
Cai, Z., Ducatez, M.F., Yang, J., Zhang, T., Long, L.-P., Boon, A.C., Webby, R.J., Wan, X.-F.: Identifying antigenicity associated sites in highly pathogenic H5N1 influenza virus hemagglutinin by using sparse learning. J. Mol. Biol. 422(1), 145–155 (2012)
Article Google Scholar
Cai, Z., Goebel, R., Salavatipour, M., Lin, G.: Selecting genes with dissimilar discrimination strength for class prediction. BMC Bioinform. 8, 206 (2007)
Article Google Scholar
Yang, K., Cai, Z., Li, J., Lin, G.: A stable model-free gene selection in microarray data analysis. BMC Bioinform. 7, 228 (2006)
Article Google Scholar
Lan, J., Shi, H., Li, X., et al.: Associative web document classification based on word mixed weight. Comput. Sci. 38(3), 187–190 (2011)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th Joint International Conference Artificial Intelligence, pp. 1137–1145 (1995)
Google Scholar
Hsu, C.W., Lin, C.J.: A comparison on methods for multi-class support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2001)
Google Scholar

Download references

Acknowledgement

This work is sponsored by the National Natural Science Foundation of China (Nos. 61402246, 61402126, 61370083, 61370086, 61303193, and 61572268), a Project of Shandong Province Higher Educational Science and Technology Program (No. J15LN38), Qingdao indigenous innovation program (No. 15-9-1-47-jch), the National Research Foundation for the Doctoral Program of Higher Education of China (No. 20122304110012), the Natural Science Foundation of Heilongjiang Province of China (No. F201101), the Science and Technology Research Project Foundation of Heilongjiang Province Education Department (No. 12531105), Heilongjiang Province Postdoctoral Research Start Foundation (No. LBH-Q13092), and the National Key Technology R&D Program of the Ministry of Science and Technology under Grant No. 2012BAH81F02.

Author information

Authors and Affiliations

Sino-German Faculty, Qingdao University of Science and Technology, Qingdao, 266061, China
Li-xun Xu
School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
Xu Yu & Yun-xia Feng
College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, China
Yong Wang

Authors

Li-xun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yun-xia Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu Yu .

Editor information

Editors and Affiliations

Harbin Institute of Technology , Harbin, China
Wanxiang Che
Harbin Engineering University , Harbin, China
Qilong Han
Harbin Institute of Technology , Harbin, China
Hongzhi Wang
Northeast Forestry University , Harbin, China
Weipeng Jing
National University of Defense Technology , Changsha, China
Shaoliang Peng
Harbin Engineering University , Harbin, China
Junyu Lin
Harbin Univ. of Science and Technology , Harbin, China
Guanglu Sun
Harbin Univ. of Science and Technology , Harbin, China
Xianhua Song
Harbin Engineering University , Harbin, China
Hongtao Song
Harbin Sea of Clouds & Computer Tech. , Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Lx., Yu, X., Wang, Y., Feng, Yx. (2016). Character Variable Numeralization Based on Dimension Expanding and its Application on Text Classification. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-10-2053-7_22

Download citation

DOI: https://doi.org/10.1007/978-981-10-2053-7_22
Published: 31 July 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2052-0
Online ISBN: 978-981-10-2053-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics