HOAX DETECTION AT SOCIAL MEDIA WITH TEXT MINING CLARIFICATION SYSTEM-BASED

Hoax is a current issue that is troubling the public and causes riot in various fields, ranging from politics, culture, security and order, to economics. This problem cannot be separated from the impact of rapid use of social media. As a result, every day there are thousands of information spread on social media, which is not necessarily valid


ABSTRACT
Hoax is a current issue that is troubling the public and causes riot in various fields, ranging from politics, culture, security and order, to economics.This problem cannot be separated from the impact of rapid use of social media.As a result, every day there are thousands of information spread on social media, which is not necessarily valid, so that people are potentially exposed to hoax on social media.The hoax detection system in this study was designed with an Unsupervised Learning approach so that it did not require data training.The system is built using the Text Rank algorithm for keyword extraction and the Cosine Similarity algorithm to calculate the level of document similarity.The keyword extraction results will be used to search for content related to input from users using the search engine, then calculate the similarity value.If the related content tends to come from trusted media, then the content is potentially factual.Likewise, if the related content tends to be published by unreliable media, then there is the potential for hoax.The hoax detection system has been tested using confusion matrix, from 20 news content data consisting of 10 correct issues and 10 wrong issues.Then the system produces a classification with details of 13 issues including wrong and 7 issues including true, then the number of classifications that match the original label are 15 issues.Based on the results of the classification, an accuracy value of 75% was obtained.

I. INTRODUCTION
very day there are hundreds or even thousands of information distributed through social media by its users [1].Information can affect emotions, feelings, thoughts, or even actions of an individual or group.It is unfortunate if the information is inaccurate or even false information (hoax) with provocative titles that lead readers and recipients to negative opinions [2].Hoax (read: / hōks /) is a message or news that tries to convince the reader about the truth and then tries to convince the reader to take certain actions.Hoax distribution depends on the reader who intentionally sends the message or news to other potential victims who might also do the same thing [3].In Indonesian the term hoax is absorbed into hoaks, the equivalent of the word for hoax is listed in the Indonesian Big Dictionary (KBBI) which is defined as untrue news [4].Muhammad Alwi Dahlan thought hoax was intentionally manipulated news with the aim of giving false recognition or understanding.The communication expert from the University of Indonesia (UI) also explained that hoax tends to be planned in advance when compared to ordinary hoaxes.Hoax contains fraudulent facts that attract public attention [5].
A lecturer of communication studies of Atmajaya University in Yogyakarta, Danarka Sasongko argued that people still could not distinguish what was right and what was not true.According to him this happened due to the low public literacy of messages on social media [6].Budi Sutedjo explained that the ability of readers to trace and criticize and rewrite the information they receive is called media literacy.The Information Technology (IT) expert from Duta Wacana Christian University in Yogyakarta also considered that media literacy could counteract the hoax news distribution [7].
The latest technology should also be able to play a role in overcoming this, one of which is a technology known as Text Mining.Text Mining is a variation of data mining that can extract useful information by identifying and exploring interesting patterns from a collection of unstructured textual data sources [8].With the text mining capability, the author considers that there are opportunities to make machines that can help humans to do media literacy automatically.
This related research has been carried out by Dyson and Golab who explore the Natural Language Processing method to detect misleading news sources.The findings of this study indicate that the calculation of TF-IDF for bi-gram can work quite well in terms of identifying unreliable sources, while the calculation using PCFG does not give significant effects [9].
Rasywir and Purwarianti have also experimented on the hoax news classification system with machine learning-based.The experiment was conducted to select the best technique in each sub-process using 220 Indonesianlanguage articles in 22 topics (89 hoaks articles and 131 articles not hoaks).The result is that the Naive Bayes algorithm shows the best accuracy compared to SVM and C4.5 with an accuracy of 91.36% [10].
Previous research-studies use English language news sources and also use a Supervised Learning approach which requires training data.Although the research conducted by Rasywir and Purwarianti has used Indonesian language news sources, it is still constrained due to the lack of training data available in Indonesian.Because of these limitations, in this study the authors propose an approach without training data or called Unsupervised Learning.The author uses the TextRank algorithm for keyword extraction and the Cosine Similarity algorithm to measure the level of document similarity.With the combination of these two algorithms, a system is then built that can measure the potential of a news hoax.

II.METHOD
A. TextRank TextRank is a method that includes an unsupervised learning approach and uses graph-based modeling.This method was developed based on the PageRank method [11].The basis of the graph ranking model proposed by Mihalcea & Tarau is by implementing the "voting" stage in each word (vertex) in the graph.A vertex will be considered important if the vertex is voted more than other vertices.The score on each vertex in the graph is determined from the following equation:

E
Where the value of S (Vi) is the value of the Vi vertex score, with the value of d as the damping factor that is set to the value of 0.85.

B. Cosine Similarity
Cosine similarity is a measure of similarity that is more commonly used in information retrieval and in this study will be used to calculate the similarity of documents.The formula used by cosine similarity is [12]: Informations :A = Vector A, which will be compared its resemblance, B = Vector B, which will be compared to the similarity, A • B = dot product between vector  [14].Calculation of confusion matrix is stated in the following equation: Informations : TP is True Positive, which is the number of positive data correctly classified by the system, TN is True Negative, which is the amount of negative data correctly classified by the system, FN is a False Negative, which is the amount of negative data but is incorrectly classified by the system, and FP is False Positive, namely the number of positive data but incorrectly classified by the system.

D. Desain Proses Mining
In general, the flow chart in Figure 1 illustrates how the hoax detection process occurs.Beginning with the user step to enter input in the form of news text into the system, then proceed with the Keyphrase Generation Process, which is the process by which the system will generate key phrases that will be used to search related content through the Google search engine.

FIG. 1 DIAGRAM OF THE HOAX DETECTION FLOW GENERALLY
After the system gets a list of related content, then it will be proceed with scraping each related content.The scrap product is then calculated which is most similar to the input from the user.With a similarity tolerance limit of 40%, it will be proceed to the next process which is calculating the percentage of probability of hoax or facts.If it turns out that all related content obtained has a tolerance limit below the predetermined then the process cannot be proceed.

FIG. 2 THE VIEW OF FORM INPUT
Figure 2 shows the page for input content that will be calculated the probability.After the content is entered, the next step is tokenization.The results of the process are displayed in the token table containing the token equipped with POS-Tag as shown in Figure 3.The Keywords tab appears which lists keywords that come from calculations using the TextRank method.Each of these keywords has their respective scores as shown in Figure 4.In Figure 5, the Keyphrase tab shows the keyphrase that results from a combination of keywords generated in the previous stage.Next, the keyphrase will be used to search related content through the Google search engine.

FIG. 5 THE VIEW OF KEYPHHRASE RESULTED
Figure 6 shows that after all the content is scraped, then the calculation of cosine similarity is done to find out which content is most similar or most relevant to the input of the user.The test uses 20 random content data / issues that have been verified by CekFakta.comand labeled TRUE or FALSE.The content data will then be used as testing data to be compared with the results of the classification carried out by the system.The following is a table of confusion matrix: Accuracy testing data from table 1 which contains 20 issues, consisting of 10 correct issues and 10 wrong issues.Then the system produces a classification with details of 13 issues including wrong and 7 issues including true, then the number of classifications that match the original label are 15 issues.Based on the results of the classification, an accuracy value of 75% was obtained.

IV. CONCLUSION
Based on the application and testing carried out, it can be concluded that the TextRank Algorithm and Similari-ty Algorithm can be combined to be used in helping the classification of news content whether hoax or facts with accuracy rate of 75%.
A and vector B, | A | = length of vector A, | B | = length of vector B, and | A || B | = cross product between | A | and | B |C. Confusion MatrixConfusion Matrix is a matrix that states the number of comparisons of data test that are classified[13]

FIG. 3
FIG. 3 THE VIEW OF TOKENTABLE IN THE FORM PIPE FIG. 4 THE VIEW OF KEYWORDTABLE IN FORM PIPE

FIG. 6
FIG. 6 THE VIEW OF COSINE SIMILARITY RESULT

FIG. 7
FIG. 7 THE VIEW OF CALCULATING RESULT TABLE IN THE FORM PIPE TABLE IN FORM PIPE