• KSII Transactions on Internet and Information Systems
    Monthly Online Journal (eISSN: 1976-7277)

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

Vol. 18, No. 3, March 31, 2024
10.3837/tiis.2024.03.004, Download Paper (Free):

Abstract

In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.


Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from December 1st, 2015)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article

[IEEE Style]
M. F. Karaca, "Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets," KSII Transactions on Internet and Information Systems, vol. 18, no. 3, pp. 591-609, 2024. DOI: 10.3837/tiis.2024.03.004.

[ACM Style]
Mehmet F. Karaca. 2024. Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets. KSII Transactions on Internet and Information Systems, 18, 3, (2024), 591-609. DOI: 10.3837/tiis.2024.03.004.

[BibTeX Style]
@article{tiis:90671, title="Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets", author="Mehmet F. Karaca and ", journal="KSII Transactions on Internet and Information Systems", DOI={10.3837/tiis.2024.03.004}, volume={18}, number={3}, year="2024", month={March}, pages={591-609}}