Designing an information system for the electronic document management of a university: automatic classification of documents

To ensure the effective functioning of the university educational environment, document flow processes automation, which includes the task of documents automatic classification, is of great importance. The article considers the task of classifying university documents by machine learning methods in order to improve the quality of classification. Documents preprocessing was carried out, which made it possible to distinguish significant words in documents, due to which the accuracy of documents classification increased. Described are methods of extracting features from text TF and TF-IDF, which determine keywords by words frequency included in document. A modification of the TF-IDF method is proposed, which consists in calculating the words importance depending on their part of speech. This made it possible to improve the classification quality by highlighting only important and significant words in documents. Suggested is a classification algorithm using a method of support vectors to reduce the documents number involved in classification and a method of k-nearest neighbor for classification. The advantage of this algorithm over the described analogues is shown, which is expressed in the number of mistakenly classified documents decrease.


Introduction
The effective functioning of the educational system is largely determined by the methods and means used to automate document flow. The existing electronic document management systems mainly solve a narrow range of tasks and do not contain integration tools with other systems. The university uses several systems for document management. The university uses several systems of document management, and this makes the information searching process time-consuming for an employee. Therefore, the problem of cost reduction, connected with information storage and retrieval, aimed at university education organizing, is relevant. The task of creating a unified information system for data storage is of great importance, as well as the task of automatically transferring documents from existing systems to a new information system. Automatic document transfer is closely related to the text classification task, for which machine learning methods are used [1]. One of the main machine learning tasks is to improve the document classification accuracy. There are methods [1][2][3][4] that have shown their effectiveness in solving the keyword classification problem. In this case, the classification takes place according to the constructed document vectors containing numerical estimates of the included words in the document. To improve the classification accuracy, IOP Publishing doi:10.1088/1742-6596/2182/1/012035 2 some researchers [3,4] use methods to reduce the document vectors dimension, while others combine classification methods [5,6]. It is noted, that the combination of the k-nearest neighbor method and the support vector machine method is effective [2,7]. The essence of the k-nearest neighbor method is that the document is assigned to the class, to which most of the nearest documents belong. The main goal of the support vector machine is to construct a hyperplane that divides documents into classes. In the considered works, the classification is reduced to finding the number optimal value of document nearest neighbors, which is equal to the number of support vectors. This solution works well for a collection of documents in which the number of documents in the classes is the same. However, for collections with different numbers of documents in classes, the classification accuracy will be lower. To improve the classification accuracy, we propose to introduce the word weight dependence on the part of speech when constructing document vectors. This solution will reduce the vector size, as well as assign more weight to the most significant words in the document. This solution will reduce the vector size, as well as assign more weight to the most significant words in the document. We propose to implement the university documents classification procedure in the form of an algorithm using the support vector machine method and k-nearest neighbor method. In this case, first the support vector machine reduces the documents set involved in the classification, then the k-nearest neighbor method classifies the documents. This solution will make it possible to improve the classification accuracy by reducing the number of documents involved in the classification process.

Document preprocessing
Document preprocessing is an important step in the document classification process because it clears document texts from words that do not have a semantic load. Consider the university documents classifying task using the SibADI document collection, consisting of 1778 documents. Each document of the collection belongs to one of four classes: organizational documents (146 documents), long-term administrative documents (671 documents), information and reference documents (316 documents), short-term administrative documents (645 documents). We carried out preprocessing of documents, including the following stages: all symbols of document texts are translated into lower case, texts are divided into words, noise words (words that do not have a semantic load) are removed from the texts, the remaining words of the texts are brought to normal form. Before we applied the preprocessing tools, the university documents collection contained 37451 unique words. After preprocessing, the number of unique words in the collection was reduced to 13460 (table 1). As a result of preprocessing, all significant words found in the document are distinguished as document characteristics. The number of unique words has been reduced by 3 times, which will reduce the number of classification errors.

Creating a Numeric Document Model
, s is number of unique words in the entire document collection. Let's consider methods for extracting characteristics from text that are used to calculate documents word weights. TF (term frequency) method evaluates the word importance within a single document [10]. According to the TF method, the word weights are the word frequency within the document, Another method, TF-IDF (term frequencyinverse document frequency), uses the document reverse frequency to reduce the weights of commonly used words [10]. According to the TF-IDF method, words most common in a particular document and less common in the rest receive more weight |D| is number of documents in the dataset; k D is number of documents in which the k-th word occurs.
The methods discussed are good at distinguishing keywords in documents. However, these methods make the most common and rarest words in documents more significant, which may not reflect the essence of the document. Because of this, the classification quality is reduced. To solve this problem, we propose to additionally check part of speech of a word when calculating weight. We upgrade the TF-IDF method by introducing a dependence of the word frequency on the part of speech. In the proposed method TF-IDF-PS (part of speech), the weight of a word is calculated as follows: We remove from the matrix W words with a zero weight value in all documents to reduce the dimension of document vectors. Table 2 shows the number of words in the BoW model after reducing the dimension of the document vectors. Document vectors contained a weight of 13460 words before dimensioning. The dimension of the vectors did not decrease after the use of the TF and TF-IDF methods. Using the TF-IDF-PS method, it was possible to reduce document vectors to 9116 word weights. Using the proposed TF-IDF-PS method, noise words will receive the least weight in the document and will not affect the classification result. Thus, the number of noise words missed in the document preprocessing step is also reduced in the document numerical model construction step. This will improve the classification quality by ensuring that the classifier will focus only on important and significant words in documents.

Building a classifier.
Building a classifier is an important part of the document classification process. Because the result of the classification depends on the selected machine learning method. We will build a classifier based on the SVM (Support Vector Machine) method and the kNN (k-nearest neighbor) method. Let's consider the SVM method. The essence of the method is to build a hyperplane that separates documents into classes. Let's look at the two-dimensional case of classification when we consider two classes of documents ( Figure 1) to illustrate how method does work. We took five long-term administrative documents and five organizational documents, which contain the words «provision» and «order» to build a hyperplane. The frequency with which words are found in documents is taken as a weight of word. The figure shows the separating plane, indicated by a solid line, it is drawn so that all documents of one class lie on one side of it, and of the other class lie on the other. Then, to classify unknown documents, it is enough to see on which side of the line they turn out to be. In general, we can draw an infinite set of planes that separate documents into classes. However, it is better to select the plane as far away from the documents as possible. In the SVM method the distance between the plane and the document nearest to it is maximized. When this condition is met, the plane is called the optimal plane, and the nearest points are called support vectors. In the figure, dashed  Kernel value is greater when documents are closer. The result of separating documents into classes is shown in Figure 2. The SVM method using the radial basis function divided the space into areas. The area in the form of a figure corresponds to long is term administrative documents, the rest to organizational documents. In this case, the document class depends on the area in which the document came. The use of SVM gives high classification rates on data with a large feature space. In addition, SVM accurately classifies documents when classes are split. However, SVM classification accuracy is reduced when classes overlap. In this regard, we propose to supplement the classification with the kNN method that allows you to accurately classify documents when document classes overlap. Let's consider the kNN method. The essence of the method is that the document is assigned to the class to which most of the nearest documents belong. The nearest documents are defined using the distance metric ( Figure 3).  We took five long-term administrative documents and five organizational documents, which contain the words «provision» and «order» to show how the method kNN classifies documents. The frequency of the words found in documents is taken as a weight of word. The distance to the nearest documents is shown by arrows. The classifier calculates the distance from the document to be classified to the training dataset documents to find the document corresponding class. Next, k nearest document neighbors are selected from the training dataset. The number of classifying documents is determined in advance and set by the parameter k. The figure shows the case of classification at k = 3. The kNN method gives high classification rates for both separable and overlapping classes. However, the method is unstable to the data outlier and classifies non-preprocessed documents worse. We upgrade the SVM and kNN methods by combining them into a SVM-kNN classification algorithm (Figure 4). Let's consider the stages of document classification with the proposed algorithm. The document classification is divided into two steps. At the first step, the SVM method narrows the documents space involved in classifying new objects to the reference vectors, which are closer to the separation plane. To do this, we reduce the task of document classifying to a multiple problem of separating into 2 classes using the «one against all» strategy. The «one against all» strategy consists of constructing one SVM per class, which is trained to distinguish the samples of one class from the samples of all remaining classes. In this case, prediction of the document class occurs according to a classifier with a maximum value. We use a radial basis function to compute support vectors. In the second step, we classify documents by support vectors using the kNN method. We use the Euclidean distance metric to determine the nearest neighbors of the document [2]   As you can see, the value obtained using the TF and TF-IDF methods deteriorates significantly over time. This result is explained by the fact that the methods are turned to noise data missed in the preprocessing step. Methods also assign weights to non-informative words, such as pronouns and adverbs. The proposed TF-IDF-PS method showed the best result. Due to the introduced dependence of the word weight on the part of speech, non-informative words missed in the pre-processing step are not taken into account in calculating weights. This allows us to improve the quality of classification due to the fact that the classifier focuses only on important and significant words in documents. It is shown that over four years the quality of document classification according to the metric gradually decreases at the collections of 2017, 2018, 2019 and 2020. The worst classification result was obtained at the 2020 collection using the TF and TF-IDF methods. The F J value compared to the result obtained at the 2017 collection decreased by 22%. The decrease in the quality of the classification is due to the fact that the style of writing the text depends on the person who wrote it. Different people may write the same documentation in different years. Accordingly, the set of keywords that the classifier focuses on also changes. The best classification result on the document datasets of each year is obtained using the TF-IDF-PS method. The    documents), the difference between the results obtained by the methods kNN, SVM and the algorithm SVM-kNN is 5% and 3%. Table 3 shows the overall result of document classification. As we can see, the results of the method SVM-kNN higher than the results of other methods. Compared to the results obtained by the kNN and SVM methods, the number of documents incorrectly classified by the SVM-kNN method decreased by more than three times.

The discussion of the results
Automatic document classification is one of the main tasks of the university electronic document management development. The use of automatic classification increases the convenience of working with the electronic document management system, as it automates the classifying documents routine process so that university employees can focus on basic tasks and decision making. Based on the study, it was concluded that the proposed method of extracting features from the text of TF-IDF-PS and the classification algorithm SVM-kNN are suitable for solving the problem of automatic classification of university documents. Numerical experiments have shown that due to the introduced dependence of word weight on part of speech, TF-IDF-PS allows assigning more weight to the most significant words in the document, thereby increasing the quality of classification. Usage of the SVM-kNN algorithm significantly improves the classification result. By reducing the number of documents involved in the classification process, the accuracy of the classification is improved and the number of erroneously classified documents is reduced. It seems promising to analyze the suitability of the TF-IDF-PS method for real-time classifier training.

Conclusion
As a result of the issue, document preprocessing was carried out, which made it possible to distinguish all significant words found in documents. This led to a decrease in classification errors, which is important for an employee of the university, since he will spend less time on manual document classification and will be able to devote working time to solving more creative tasks. Classification was carried out by hybrid method, which includes reduction of multiple documents involved in classification by method of support vectors and document classification by method of knearest neighbors. This solution allows us to distinguish among documents the most informative of them and accurately classify objects in the case of overlapping classes. It is proposed to calculate features importance depending on part of speech. This reduces the number of noise words missed in the preprocessing step and improves the classification quality due to the fact that the classifier focuses only on important and significant words in documents.