A Comparison of Different Approaches to Document Representation in Turkish Language

: Recently, deep learning methods have demonstrated state-of-the-art performance in numerous complex Natural Language Processing (NLP) problems. Easy accessibility of high-performance computing resources and open-source libraries makes Artiﬁcial Intelligence (AI) approaches more applicable for researchers. This sudden growth of available techniques shaped and improved standards in the ﬁeld of NLP. Thus, we ﬁnd an opportunity to compare different approaches to document representation, owing to various open-source libraries and a large amount of research. We evaluate four different paradigms to represent documents: Traditional bag-of-words approaches, topic modeling, embedding based approach and deep learning. As the main contribution of this article, we aim at evaluating all these representation approaches with suitable machine learning algorithms for document categorization problem in the Turkish language. The supervised architecture uses a benchmark dataset speciﬁcally prepared for this language. Within the architecture, we evaluate the representation approaches with corresponding machine learning algorithms such as Support Vector Machine (SVM), multi-nominal Naive Bayes Algorithm (m-NB) and so forth. We conduct a variety of experiments and present successful results for the Turkish document categorization. We also observed that tradition approaches have still comparable results with Neural Network models in terms of document classiﬁcation


Introduction
Words are needed to be represent in vector space models (VSM) for the Natural Language Processing (NLP) problems.For years, VSM has been used in the field of NLP to compute semantic similarity.First, [24] represented a word as a real-valued vector by using co-occurrence statistics to measure the semantic similarity.It is based on the idea that if two documents or words share similar neighboring words, they are considered similar.The similarity between the vectors of the words are simply computed by cosine similarity and other metrics.The approach uses a fixed-length representation of a document within document-term matrix.It is also called the bag-of-words (BoW) where the bag contains the words of a document by ignoring word order in it.
The main drawback of traditional BoW approach is high S. Yıldırım, T. Yıldız / A Comparison of Different Approaches to Document Representation in Turkish Language dimensionality.A variety of approaches have been applied to dimensionality reduction.The common way is the feature selection by exploiting some selectional criteria.Some other kind of paradigms namely topic modeling have been also used for reduction.Latent Semantic Indexing (LSI) (or Latent Semantic Analysis-LSA) have been mostly applied to such document-term matrix to reduce the dimension [9].Another popular and widely used topic modeling is Latent Dirichlet Analysis (LDA) that is a new variant LSI-based paradigm especially for textual data [7].It can be considered another document representation model where a stochastic gradient optimization algorithm clusters the documents based on word co-occurrence statistics and this makes a new representation for textual data.There is Probabilistic Latent Semantic Analysis (PLSA) that employs probabilistic method rather than using matrices.
Recently, neural network language models (NNLM) have demonstrated promising performance by reducing time complexity and successfully solved many NLP problems [18].They effectively generate dense and short embeddings, namely word embeddings [22,23].For document embeddings, averaging word embeddings in a document could be a way for the representation.On the other hand, [17] proposed another method to produce document vector which is similar to word embeddings.The method directly produces document embeddings along with the word embeddings.They found document embedding very effective for the problem of sentiment analysis and document retrieval [17].There are some deep learning architectures such as the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN) [19].They implicitly map the documents layer by layer and classify the them.
The motivation is to evaluate these representation model for the problem of document classification in Turkish language.The study can be accounted the first attempt that compares these approaches for Turkish document classification problem.We designate a supervised architecture by using a benchmark dataset in which the documents are collected from news corpus and are labeled with categories.The different representation techniques from NNLM to traditional approaches have been applied to the documents.Then, the architectures trained the model using these new representations as predictors and the corresponding categories as target class.We measured how effectively each paradigm represents the documents for the problem.We report our results with a detailed analysis as described in the following section.

Turkish Document Classification in Literature
There exist a variety of studies for the Turkish document classification problem.These studies have examined the topic, authorship and gender detection.As one of the earliest studies, [4] handled document classification problem using n-gram and BoW approach.In order to find authorship, specify the genre and gender, four different machine learning algorithms have been applied and been received successful results.This study can be considered as one for the first attempt for the Turkish language.As one another earliest study, [29] applied classification methods for 18 different authors, 35 different document for each author.They compared 5 different classification algorithms using n-gram and BoW approach and addressed that using n-gram with other features highly improves the system performance.They found the Artificial Neural Network (ANN) and Support Vector Machine (SVM) better than other algorithms.The most comprehensive study examining the traditional methods has been done by [3].They evaluated the contribution of the document representation techniques through six different text classification problems.The problems are detection of four different moods from a given text, three different sentiments from movie reviews, detection of authorship out of 18 journalists, gender identification, detection of 5 different news categories and detection of poet of a given poem out of 4 poets.
Recently [16] constructed a benchmark dataset, namely TTC-3600, including 6 different classes in order to study document classification problem.They used BoW, n-gram approaches and feature selection method with the dataset.Six different machine learning algorithms have been evaluated.Stemming and Attribute ranking-based Feature Selection (ARFS) has been utilized to improve the performance.They concluded that the Random Forest (RF) algorithm with Stemming and ARFS has performed better than other configuration.Some studies especially examined the preprocessing phase to evaluate its contribution to the models.[27] measured the contribution of using longest and smallest stem suggested by Zemberek1 library to the classifiers performance where the classifiers are RF and NB.They concluded the former is better then the latter.Another study measured the contribution of stemming to the model performances [2].However, the number of examples and the classed are not sufficient level.They noted that the they did not find statistically significant contribution of using stemming.On the other hand, they alternatively applied very simple and different stemming approach which takes only the first K characters of a given word, FPS (Fixed Prefix Stemming).They interestingly noted that the FPS outperformed the traditional stemming where the optimum K is found 5 where the machine learning algorithms are SVM and NB.Another important study examining the contribution of preprocessing phase has been conducted by the study [28].Stemming, stopword elimination and feature weighting were among the techniques applied to Turkish text classification.They applied Zemberek and FPS7 stemming approaches and concluded that the stemming has been found very useful for the information retrieval but the text classification.They did not see any contribution to the problem.[30] examines the preprocessing phase and its contribution through the domain, natural language and dimensionality reduction.Two different languages and different domain information has been used within a well designed architecture.They concluded that a improper system designed might lead to underestimation and poor results as well.[31] examining stemming phase has used information gain and Naive Bayes (NB) classifier.They found that the contribution of stemming to the text classification was quite limited.
Other than the traditional approaches, there exist some alternative approaches designed for the Turkish language.
[1] has utilized Hidden Dirichlet Analysis (HDA), LSA and LDA and discussed the effect of dimensionality reduction for the problem.They compiled an annotated corpus including academic articles written in Turkish.The system has been tested for two different datasets in which there are 18 and 34 classes respectively.Although they found stemming useful using the classifiers SVM, NB and RF, however we observed that the results obtained are very low against other studies in Turkish language.As a different study, [10] utilized word semantic analysis for the text classification problem.The study constructed a new representation model using semantic similarity between words and solved text classification problem within the semantic space.The words have been clustered using euclidean distance and the cluster membership information has been used to create new dimensions for the documents.Consequently, the documents are then represented in a smaller dimension.With the dimension size of 100, they got 92.5 % success rate using Logistic Regression (LR) classifier.
Ensemble learning is considered an important learning approach in the field of machine learning, even though it suffers from over-fitting in some cases.[5] compared the ensemble learning with traditional machine learning approaches using two different a 6-classes datasets for the text classification problem.In order to reduce running time complexity they utilized pruning techniques.They pruned the forest with the rate of 90% even without losing success.

Document Representation and The Methodology
Supervised learning models require tabular data format in which it induces a function between the independent variables and the dependent variable.Thus, the documents are first needed to represent in a fixed sized tabular data along with their categories such as economy, sport so forth.There are variety of ways to represent the document in a fixed-sized vector.We applied all these techniques as listed below to our document classification problem in Turkish language.

BoW model
Vector space models are used to represent a document or a word by embedding it into a vector space.For years, VSM has been used in the field of NLP to compute semantic similarity.First, [24] represented a word as a one-hot vector by counting co-occurrence statistics with other words where the dimension of the matrix is equal to the vocabulary used.Fixing the size of the vector makes easier to compute the similarity and other metrics.To compare two documents or words, the cosine similarity function is applied to the vectors of same size.This fixed-sized approach creates document-term matrix where rows indicate documents and columns show words.It is also called the bag-of-words (BoW) where the bag contains the words of a document by ignoring word order in it.
The main disadvantage of BoW representation is the enormous number of terms that is equal to the size of the vocabulary.As the dimension of the vector exceedingly increases, so does computational complexity of the designed system.The widely applied solution is the feature elimination in the preparation step.It discards non-informative terms based on some metrics using corpus statistics.[25] pointed that the frequent terms could be informative.Some selectional criteria such as chi-square (χ 2 ) are found very effective to find informative terms from corpus, [20,21,25].Some studies addressed that the most effective selectional criteria is Information Gain (IG) [21,25].It measures how many number of bits of information the presence or absence of a word in a document contribute to model accuracy.χ 2 is another widely used formula in many field such as statistics.It tests the lack of independence between a word and a category using document-term table.Some other selectional criteria such as point-wise mutual information (PMI) and DICE are also used.

Topic modeling
As one of the dimensionality reduction approaches, Latent Semantic Indexing (or Latent Semantic Analysis) have been widely applied to document-term matrix to reduce the dimension and produce informative and short latent dimension.LSI uses Singular Value Decomposition (SVD) as a method for building significant dimensions derived from a document-term matrix [9].It is a member of a method family that can approximate an N-dimensional matrix using fewer dimensions, including Principle Components Analysis (PCA), Factor Analysis etc, [14,15,26].Some latent indexing approaches do not use matrices.For instance, PLSA employs probabilistic method rather than using matrices.It is also called probabilistic latent semantic indexing.
LDA can be considered another document representation model where the stochastic algorithm clusters the documents based on co-occurrence statistics [7].It represents documents as a list of discovered topics.Topics and their probabilities are learned as discrete distributions where the topics consist of a set of words.The models takes the topic size and words size as parameter before training phase.The documents are then represented by means of latent semantic structures, topics.Contrary to feature selection models, LSI and LDA does not explicitly use the words as dimension, but create most informative latent dimensions by word composition instead.

Document and word embeddings
Recently, NNLM have gained big attention and demonstrated promising performance by reducing time complexity.The most important characteristics of NNLM is its capacity of generating dense and short embeddings, namely word embeddings [22,23].In the neural networks architecture, each word is initially associated with a random vector.As a two-layer neural network processes textual corpus, the vectors are iteratively updated by applying stochastic gradient descent (SGD) where the gradient is measured by back-propagation.The objective is to guess the last word from a given word sequence.Thus, the prediction task is typically similar to multi-class classification where soft-max function is used to compute class probability estimation.The network finally learns the embeddings for all words appeared in the corpus by convergence.
As one of the most popular word embeddings models, word2vec model showed how word embeddings were efficiently trained within two different architectures, namely Continuous Bag of Words (CBoW) and the Skip-gram (SG) [22].The architecture achieved both minimizing computational time complexity and maximizing model accuracy.As second model, [23] proposed another word embedding model, namely glove.It is based on matrix factorization and a new global log-bilinear regression model that combines the advantages of the two important models in the literature: global matrix factorization and local context window methods.These two popular word embedding models also proved that embeddings are very good at capturing syntactic and semantic regularities, using the vector offsets between word pairs.Naturally word embeddings also help to improve document representation.Averaging all word embeddings in a document is considered possible representation of the document.On the other hand, [17] presented another neural network based approach to train document embeddings, called paragraph vector.Learning paragraph vector is highly inspired by the neural network architecture of word embeddings, word2vec.The architecture trains the vectors by a process that predicts the last word using other words in a given context.The network uses a fixed-length context by a sliding window with a size of K.The paragraph vectors are learned in a similar manner where each paragraph is initially associated with a random vector and added to head position of each contextual window.And the architecture tries to predict the last word using all vectors of the words in the context plus the paragraph vector.The architectures use either averaging or concatenation of the vectors.
The paragraph token is shared across all contexts generated from the regarding paragraph but not across the paragraphs.However, the word vectors are shared across paragraphs.As the paragraph token acts as a memory and its vector is always added to each context, this model is called the Distributed Memory Model of Paragraph Vectors (PV-DM).The paragraph vectors and word vectors are trained using SGD and the gradient is obtained via back-propagation.At every step of SGD, the error gradient is computed via NNLM and the parameters of the model is updated using the gradient.An alternative way proposed is to ignore the window contextual words and rather randomly selects a fixed-number of words from the paragraph in hand.At each iteration of SGD, a random word is selected as target class, remaining sampled words and the paragraph token are used as predictors as in the multi-class classification task.Since that it is very similar to bag-of-word approach, it is called the Distributed Bag of Words version of Paragraph Vector (PV-DBoW).Moreover it can be considered the counterpart of skip-gram (SG) model of word2vec implementation.

Deep learning
Word embedding methods learn a real-valued vector representation in a fixed sized vocabulary.Word2Vec or Glove is a two-layer shallow neural network that takes textual corpus and produces a set of word vectors.Therefore Word2vec is not considered a good example of deep learning.But such methods can turn documents into a realvalued vectors that deep learning networks can understand.The CNN and RNN which are the two main types of deep neural network architectures, are utilized for addressing to various NLP tasks [19].The CNN is a feed-forward network equipped with convolution layers interleaved with pooling layers.Max pooling is used to reduce the number of parameters within the model and generalizes the results from a convolutional filter.The RNN is used for modeling units in sequence and temporal dependencies [12].While CNNs are mentioned as hierarchical architecture, RNNs are sequential architectures [32].However, simple RNNs have difficulties to capture long-term dependencies because of vanishing or exploding gradient [6,13].One of the solutions resolved the vanishing and exploding gradient problem using gating mechanisms which have been developed to alleviate some limitations of the standard RNN, Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) [13], [8].We applied all these deep learning algorithms using python keras libraries with the default parameter.

Data
In this study, we used a benchmark dataset compiled from Turkish newswire documents under seven different categories; world, economy, culture-art, health, politics, sports and technology.There exist 700 documents under each category and 4900 documents in total.The total number of tokens in the corpus is over 1.3 M. The data was obtained from publicly available web site2 .The white space characters, all digits, punctuations were removed.Some collocations are obtained and annotated such as Is-tanbul_Belediyesi.The collocation phrases are captured by log-likelihood ratio of the probability.We only used surface form of the terms and the number of unique terms are about 100 thousands.

Representations
Contrary to other paradigms, feature selection based BoW approaches require three separated data; development set, training set and test set.We selected one third of documents as development set.Remaining data is divided by 10-fold cross validation.The first development set is exploited to specify informative words to reduce dimensionality.Feature selection metrics use this first set to rank and eliminate the words.We observed and report that IG and χ 2 are the most powerful selectional criteria.And the all documents are represented by those selected informative words.The second set is used to train classifiers and the last test set is used to measure model performance.It is typically similar to the validation phase of machine learning, such as K-fold cross validation.
Topic modeling simply represents the documents in a dimension of given size.Both LSI and LDA require the size of dimension that is the number of topics to train the model.We set the dimension size to 200 and 400 respectively.Each dimension is actually probabilistic composition of the most contributing words.We set the size of the words to 10. Topic modeling produces numerical variables for the representation in the end.Word embeddings averaging is a possible way of the representation.We exploited two important embeddings models for word embeddings; word2vec and glove where dimension size is set to 300.The documents are represented by averaging the vectors of words.There is also alternative and more effective way of document-specific embeddings, namely doc2vec.We prepared doc2vec document embeddings architecture that consists of two different settings: Distributed Memory Model of Paragraph Vectors (PV-DM) and Distributed Bag of Words version of Paragraph Vector (PV-DBoW).These two models suggest different document representation.Concatenation of these two representation can also be used as a third representation.Finally, we applied three representations of document embeddings: PV-DM, PV-BoW and PV-DM+BoW.Embedding based models, word embeddings averaging and document embeddings, also produce numerical variables for the representation.

ML algorithms
Especially for BoW representation, mNB has found an apparently effective classification method for text classification so far.It has been found competitive with the state-of-the-art algorithms in many studies [21].We applied mNB by setting smoothing parameter alpha to 1, by default.Since mNB requires non-negative continous variable, we can only apply it to BoW representation but other representation model since that other representation might produce negative value in the vectors.SVM has recently been found another popular machine learning algorithm for many data mining problems.Many researchers also proved that performance of linear kernel of SVM is better than that of radial and Gaussian especially for Text Categorization problem.It is quite similar to perceptron.
While the perceptron is used to minimize misclassification errors, the objective of SVM is to maximize the margin which is the distance between the separating planes.The objective function maximizes the margin between the decision boundaries.When SVM is used with linear separator, it is called linear SVM.There exist non-linear alternative for the SVM, called kernel SVM.We used SVM with linear kernel where the regularization is L2, penalty parameter C is set to 1 and loss function is squared-hinge.These are mostly default parameter of python sklearn library.Another appropriate algorithm is Logistic Regression algorithm.This algorithm derived from linear regression by the application of sigmoid function.This algorithm is found very useful since that it predict class estimation probability as well.While it applies linear separation, There is also non-linear variant for the logistic regression.We chose L2 penalization, set C parameter to 1 for logistic regression algorithm.Decision Tree (DT) is another widely used machine learning algorithm because of its interpretability.It applies divide-and-conquer mechanisms to derive a decision tree.It learns a list of question using features as question.In an iterative process, the data is recursively divided into two or more until the leaves are pure.It means that it creates new node until the examples in divided subset belong to same class.The features are needed to sort depending on their importance where there exists two popular impurity measure gini and information gain.We select gini as the impurity measure.The minimum number of sample for a leaf is set to 2. The last algorithm is selected from lazy learner, KNN.The algorithm does not induce a model.Instead, during the decision time, it compares a given instance to the remaining dataset.Therefore it does not have training time but has big testing time.The algorithm take the first K neighbors and decide the class based on majority class of the neighbors.For our experiment, we set K to 3 and select distance function as euclidean.

Implementations
All the implementations based on four different paradigms discussed above were applied to same benchmark Turkish dataset in which there are total 4900 documents labeled with seven different categories.The experiments for word embeddings, doc2vec, Glove, LSI, LDA, CNN and other functions were mostly conducted with Python programming language and its libraries: gensim3 , nltk4 , sklearn 5 , keras6 , libsvm 7 .R8 and Weka9 platforms were also used for cross validation, data preparation and some other analyses.In order to fairly compare the results with other studies, the comparison requires same configuration and same degree of difficulty.In Turkish, [4] achieved % 93 success rate for three classes genre detection.Each class consists of only 200 examples.[11] obtained at their best 95.8 % for six-class category detection where there is only 100 documents under each category.There exists another study whose configuration is roughly equal to that of our study where the number of classes is 6 and there exists 600 document for each category, [16].They achieve % 90.1 and % 91.3 success rate with Random Forest and Zemberek Stemmer.The second model applied ARFS and obtained better results.When we use only 6 classes in our dataset, we got comparable results with them.

Conclusions
In this study, we applied different document representation approaches to Turkish document categorization.We categorized the recent studies under four different paradigms and applied them for document categorization in Turkish language.To give an equal comparison of these approaches, we prepared a benchmark dataset under seven document categories.The methods were tested within a supervised learning architecture, where popular machine learning techniques such as SVM or Naive Bayes were applied to the generated representations.We demonstrated that document embeddings models and traditional bagof-words approaches achieved equally successful results.Although word embeddings, topic modeling, deep learning approaches have been successfully applied to word semantics, the document embeddings and traditional methods outperformed them for the document representation.On the other hand, interestingly, traditional BOW approaches still showed a comparable performance for the representation.Our architecture achieved successful results of 90 F1-score for the Turkish language.

Table 4 .
[17] Learning (F1 scores) still have good capacity.[17]addressed that NNLM based paragraph model outperforms traditional models for the problem of sentiment analysis and the some IR related problems other than document classification.But for the Turkish text classification problem, paragraph vector and BoW approaches share the similar score.