Document Searching of EPPS Test Result Using Indexing Method

This study focuses on documents searching on the EPPS test result, mainly administered in the education sector for selecting prospective students in universities’ admissions process. There has been a considerable increase of admission data along with annually administered admission test for prospective students in every beginning of an academic year. Some faculties may administer the EPPS test in the second stage test of admission process after the prospective students pass the first stage in addition to having the general admission test. These several tests lead to a huge data accumulation with different information, making it difficult for the organizers to locate the document containing 15 aspects of required information in a comprehensive information display. Thus, it is necessary to apply an indexing method by having a predetermined index to retrieve required information in a particular document or to find other related documents by way of indexing method. This research used 74 pdf documents as the dataset. In particular, 60 datasets were used as references to retrieve phrases on the specified data, and 14 datasets were used to test data. The results revealed the most significant data with an average precision of 0.89, recall of 0.93 and f-measureof 0.91.


Introduction
The psychological test administrator has several documents stored in a computer directory. One of the documents stored is the EPPS test result. the number of test result documents is in accordance with the number of participants who completed the test. The EPPS (Edward Personal Preference Schedule) test is a psychological test that is performed by an individual test taker or by an organization. The EPPS test is widely used in the education sector, mainly to select prospective students in universities admission. Normally, the result of the EPPS test file will be stored as a pdf document containing the psychogram matrix and the concluding result of test taker performances.
As the new academic year begins, multitudes of prospective students apply for university admission based on their desired majors. The admission of countless applicants' data along with the result of their EPPS test will result in a huge pile of data that require large data storage in every test period. Given the large number of new prospective applicants' data and new information, it is difficult for the organizers to provide prospective students with accurate result of the selection process. The ineffective and incomprehensive data display requires the organizers to re-filter a number of documents and locate the particular search result according to the predetermined assessment. Moreover, this condition is worsened by absence of recorded data that can display new information IOP Publishing doi: 10.1088/1757-899X/1077/1/012032 2 according to the existing data in a more efficient presentation to be easily understood by both experts and organizers, which remains an obstacle at this time. At certain times, it is also difficult to locate certain documents containing required words. The only way to find certain information from a set of documents is by looking for the similar words contained in a document. This process is normally done manually, which is definitely time consuming, especially when there is a need to locate certain documents in large numbers.
Searching a certain document by way of an index method helps in finding indexes that contain the required aspect of EPPS test results and locating the index of the words searched for in the document. It is possible to classify the document based on its similarity through the method of word indexing from a set of documents. The indexing process aims to change an unstructured document into a structured document to ease the process of finding the index of the necessitated word and thus to display the document [1].
Given the above explanation, this research aims to propose a method to ease the organizers in locating required data and filter the number of potential applicants by finding the important and required information of the EPPS psychological test conclusions using the predetermined aspects of needs and by locating the position of the stored documents. This research was targeted for high school students majoring in Science who would apply for the Medical Faculty of UII.
On this account, the researchers tried to propose the indexing method as an efficient way for searching for information and document similarities to find suitable information regarding the required aspects contained in the EPPS test conclusions and to look for the similarity of recorded data with new data information in order to get the highest results based on the aspects needed by the organizer. This system , a dictionary or related meaning to aspects of need is developed in the measuring instrument used so that searches can be spesifict to the content of the document in the referenced direktory. This research is formulated to answer the following research question, (1) how to find and display the information contained in the psychogram matrix results of the EPPS test to facilitate the search for the desired needs?
This paper is written based on the following structure of a research paper: the background, review of related works (literature), research methodology, the results and discussion, and the conclusions of this study.

Literature Review
Indexing method is the process of finding information from a large and unstructured document database in response to a required query [2]. The indexing process is carried out to display the desired information in the database by organizing the unorganized set of documents into a more structured document for an easy retrieval and easier documents finding based on the information contained in the required document [3].
This study addressed the EPPS (Edward Personal Preference Schedule) test as a form of personality test consisting of several answer choices to be selected by individuals according to the available questions. This test aims to find out personal needs that reflect the individual's personality [4]. This test was developed by Allen L. Edward by requiring us to select fifteen required aspects out of twenty compiled points that refer to Murray's personality concept. This personality test contains 225 questions, namely the compelling, objective and non-projective. The EPPS test is administered to observe the specific needs of an individual through 15 required aspects [5].
There have been several previous researchers on this topic as references of this study. The first research discussed the use of the indexing method in the search process for documents generated by manual search results. The advantage of index retrieval lies on structured data storage in order to store large amounts of various documents. The search process is done by creating a word index from the document [1].
The second study raised a topic on the search for of text documents of archived letters using indexing and query methods. This research discussed the process of restoring information with an index so that documents can be retrieved when needed and can be located easily. It also used the index IOP Publishing doi:10.1088/1757-899X/1077/1/012032 3 method to retrieve an index that has been registered in the database to find the desired sentence. With more detailed processing in the query development process, word grouping can be summarized more simply [2].
The third research entitled "Single Document Keywords Extraction in Indonesian using Phrase Chunking" [6]. This research discussed the formation of keywords from documents automatically using phrase concatenation. Phrase chunking was done using the Part of Speech (POS) pattern to extract phrases and keywords. Keywords were selected from the candidates based on the number of words in the phrase and several types of phrase data forms. The shortest keywords with reduced strings were extracted from the abstract and sorted by their highest frequency values. This study conducted an experiment using types of phrases such as different sources of keywords, variations in phrase patterns and provided performance comparisons between types of phrases.
The fourth research discussed the use of a retrieval approach as a solution to existing problems by relying on concepts rather than keywords for indexing and retrieval. It aimed to retrieve documents that were relevant to a specific user request. Indexing and weighting of documents and queries were done using semantic entities, concepts, and keywords. In this approach, concepts were identified from the document and measured according to their frequency distribution and their similarity to other concepts in the document [7].
The technique used for text documents is widely used with the IR (Indexing and Retrieval) technique. The purpose of an IR system is to retrieve relevant material from a document database in response to a query [8]. The five research discussed the design of information retrieval system by employing Vector Space Model in the tracking and also similarity measures of Cosine Similarity as the method to rank the documents found that match with the keywords/query [9] 3. Research Metodology We proposed an automatic keyword extraction technique using generate frase in Indonesian document. Figure 1 shows the methodology of this research.

Document Conversion
At this stage, the prospective applicants' psychological test results are collected into one EPPS folder in pdf form. The document will be an input data to be processed further. The EPPS test results contain two paragraphs of sentences relating to the applicant's most and least required aspects. The data on the EPPS test results will be taken by indexing the paragraph "description" that appears on the document and it will be processed further through the preprocessing process.

Preprocessing
The preprocessing is an early stage process that is prepared to carry out further processing of the text in the data before going through the indexing process. The process in this preprocessing stage only includes several steps, namely casefolding (changing the front letter to lowercase) and tokenizing (separating each word into one line only). The results of the preprocessing will later be saved and

Split Data
At this stage, the dataset is divided randomly into training data and test data with a ratio of 80%: 20%. The training data will be used as a source of formation of a list of phrases and labels, which will later be matched with the document. Meanwhile, test data is useful for looking for occurrences with a value similar to the recorded data.

Manual Labeling
At this point, the converted csv file will contain a table of "phrases" and "labels". The tables have been filled by phrases by the system from the conversion results, while the label table is still left empty that will require manual labeling by the researcher. The label will be entered manually based on the abbreviated name of the aspect of need. The phrase that is labeled contains the initials "other" or "O" if there are phrases not related to the label and initials from 15 aspects of the need if the phrase has something to do with the label.

Phrases List Formation
At this stage, the labeled training data will take the index of every word that contains a phrase with initials other than "other" to separate the phrases that will be stored in a new csv file named "phrase_list.csv". The phrase takes each pair of phrases and labels that have been manually inputted by the expert. The phrases in the phrase list will be sorted randomly based on the long and short-labeled words that are labeled by the researcher in the manual labeling process. In the manual labeling process, the author labels three or more words/tokens on the file, which will come out at the top of the list. In contrast, the words with fewer labels will appear in the bottom line of the phrase list. In this study, the word indexing method was used to find information related to potential applicants. The information to find is the label on each file. At this stage, the system will identify the phrase by matching the label specified in the phrase_list. To carry out this process, test data is needed to determine the suitability of the created model to find the information that appears along with the names of aspects of needs. The process is identified based on the word "but" after the sentence, which will make the system classify the particular sentence as the highest need (most prioritized need) and the lowest need (least prioritized need). Afterwards, the system will check the index based on any phrases that appear in the document to identify sentences that refer to the label on phrase_list. The output will provide a description of the phrases and labels that appear on each file along with the label and a summary of the aspects of the needs that come out of the file. The txt results are stored in a different folder with the document folder. Figure 2 shows information identification process.

Document Search
Once the search results of the desired aspects of needs are known at this stage, these results will become a reference to see which prospective applicants of FK UII with the highest need in that aspect. Document searching can make it easier for the organizers to find documents of prospective applicants of FK UII based on the 15 desired needs as a component of the next assessment. For example, the system can recognize the need for "ACH" as the highest need, and thus the system will display data on which applicants have the highest needs for these aspects of needs. figure 3 shows document search process

Evaluation
Furthermore, to conduct a model testing, we need original data and new data to be tested will to obtain similar degree value. After obtaining a document containing relevant information, the test data will evaluate the original data against the new data by looking for the similar degree value. Then, it will evaluate each type of phrase form based on its precision, recall, and f-measure. The data to be used as test data amount to 20% of the entire data, which is equal to 0.20.

Result
The previous researches indicate the indexing method as a suitable method to solve the problems in document searching. The modeling stage of the obtained data aims to produce a system that is able to identify every word that appears related to 15 aspects of predetermined needs and provides information on which aspects of needs arise at the conclusion of the EPPS test. To help create this model, the researchers used the python3 programming language through the terminal and IDE Visual Code.
It is explained in formula 3.1 that the conversion process of the EPPS test results document, which is saved in the form of a pdf file into a csv file that takes the paragraph in the "description" index, will be carried out. Before converting the data into a CSV file, the text data is converted to a txt format to get all the text in the PDF file. The document conversion process requires a preprocessing process of casefolding using the text.lower () function and tokenization using the word_tokenize function. The output of one of the 74 documents converted to documents and preprocessing is presented in the figure  4. Once we obtained the converted documents, the 74 documents will be split into two parts, namely train data and test data with a default ratio of 20% of the train data. The train data and test data that have been converted will be labeled manually by the researcher by looking at several criteria from the aspect of requirements on the EPPS test. Labeling is done directly from the csv file of each document. The criteria for aspects of need along with keywords are presented in Table 1.
Furthermore, in the process described in formula 3.4, the text in the "description" is separated and stored in the csv "phrase_list" form with a table containing the pairs between the phrases and the labels. The sorting process uses the "sorted" function to sort the phrases that are labeled by authors. Phrases in phrase_list will be sorted based on phrases labeled with three or two words or only one word so that they are stored according to the longest and shortest word entered by the author. A list of the list phrases can be seen in Figure 5.

Information Identification
In the process of identifying information, the system will display the meaning contained in the EPPS test psychogram to make it easier to find out which aspects of the needs appear in the EPPS test conclusions of prospective applicants. Proses information identification we need to look for an index to separate paragraphs describing the highest and lowest needs. In this case, the word that distinguishes between the highest and lowest needs is the word "however" in the beginning of sentences. Some characters that appear will also be replaced using the regular expression library in python. Figure 6 shows the output of the identification information based on phrase_list.csv. The phrase "Highest need" displays the phrases list containing the highest need of the EPPS test result, while the phrase "Lowest need" displays the phrase list containing the applicant's lowest need. The phrase list shows an information about the phrases and labels written in the sentence that concludes the EPPS test

Document Search
Once the system finds and identifies the information, it will be easier for us to search for documents through the aspects of the needs listed in the results of each document that was identified in the previous process. This document search can be done everytime a user wants to find an information from multiple documents based on which of the 15 aspects of the desired need is required. If the user requires the search for "ACH" and "DEF", the system will randomly search for the document containing those words as the highest aspect of needs according to the desired result. Figure 7 shows the results for documents search process using the desired label of "ACH" and "DEF" which can be stored in the form of a txt file as what the experts required.

Evaluation
The next stage is to test and evaluate the test data with the original data on the system to look for the appearance of similar degree values. It showed 14 test data that had been split previously in the 3.2 process. In this evaluation, precision, recall, and f-measure are used as a form of evaluation of the system. The final results of the model testing are resulted from the calculation of the average value of each document shown in Table 2.  Table 2 displays the average results of the evaluation using different types of phrase forms. This search obtained the best results of all reducted phrase form types with an f-measure value of 0.97 for all data. The above test result does not contain the aspect of accuracy because accuracy is more appropriate for class distribution, while f-measure is a better metric for cases with an imbalance in evaluating the model.

Discussion
In this study, the system was designed using an indexing method based on a list of phrases the researcher created to obtain information in the text "description" of the EPPS test results. Based on the literature review, the indexing method in this study is done by adding a list of phrases as a dictionary IOP Publishing doi:10.1088/1757-899X/1077/1/012032 10 to help identify information and to find out which documents are needed by the organizer. Figure 3 displays 60 lists of phrases that are formed. Phrase list formation is done by looking at examples of cases of phrase truncation and name entity recognition, where the type of truncation is based on the phrase tag used in the parse tree and the name of the phrase tag and the method of classifying text phrases with certain rules based on word classes. (POS) [10], [11].
The indexing method obtained a quite good result since it only used small amount of data. These results are obtained from a list of identified phrases, which are used as a reference. As a result, the system can match the information that appears in the EPPS test "description" text by producing phrases and labels along with the appearing aspects of needs to ease users in reading the information. However, such method is still constrained when it has to deal with a huge amount of data because it will need a throrough update to check the accuracy of manual labeling. Nevertheless, since this case used manual labelling, it is necessary to add an updated system to check system performance to avoid any errors in the inputted data. Therefore, it is pivotal that an expert in this field does manual labelling to avoid misinformation.
In this case, the system does not focus on using stopword removal, but on the removal of the word "other" contained in documents that have been manually labeled "O" to create a list of phrases or phrase_list. The development of stopword removal is a considerable topic for future research to make sure whether it should be used or not. In the evaluation, the results are not counted into the calculation of accuracy because in the search results the accuracy will be close to 100% or the "True Negative" result, which has an infinite value although in reality the results are just the reverse. Some existing examples, as shown by the TN, contain words that cannot be identified and should not be identified, but are still calculated using an accuracy formula that produces results close to 100%.

Conclusion
Based on the initial stages until the final stage of the research, we can conclude that the method of identifying information through the indexing process is an applicable method. This method can be used to find information related to the aspects of the needs required by the structured EPPS test for prospective applicants, since it can search for information by referring to the information identified in the phrase_list. The f-measure calculation obtained the result of 0.91, the precision of 0.89, and the recall of 0.93.
However, this method is not a benchmark to create a good system, since it still needs further updates and development. Therefore, it is expected that there will be further research to develop the current research with some improved elements such as by increasing the number of datasets and continuously updated list of phrases in order to get better results. It is also necessary to develop the current research by constantly checking and correcting keywords in the converted PDF file. Other possible developments can take the form of using other more suitable methods out of the indexing method.