Sentiment Analysis on the FIFA U-20 World Cup in Argentina Using Support Vector Machine

Abstract


A. Introduction
The advancement of Technology and Information has laid a strong foundation for social change, particularly in communication and the global dissemination of information, especially in Indonesia, where Internet usage, as demonstrated by the Indonesian Internet Service Providers Association (APJII) survey from 2019 to 2020, is most common among the age group of 15-19 years old with 91%, followed by the age group of 20-24 years old with 88.5%, with a focus on communication (32.9%) and social media (51.5%) [1].The We Are Social report revealed that 213 million people in Indonesia were online in January 2023, comprising 77% of the population, marking a 5.44% increase from the previous year and a total of 142.5 million new users since 2013.The advancement of Technology and Information has led to the emergence of social media platforms such as Facebook, Twitter, and Instagram, where users engage, share content, and make social connections [2].The impact of social media on communication, information sharing, and social relationships is very significant.The global social media users currently reach 4.76 billion, forming 59.4% of the world's population.In Indonesia, the percentage of social media users reaches 60.4% of the country's total population for the same period [3].Twitter has emerged as a popular mode of communication in Indonesian society due to its growing user base and rapid dissemination of trending information and news [4].Twitter users can receive real-time feedback like status updates, known as "tweets."Tweets contain catalogs of individual preferences, reluctance, and feedback on various subjects [5].According to We Are Social, there were 564.1 million Twitter users globally in July 2023.Global Twitter users increased by 16.1% year-on-year and 51.3% quarter-on-quarter.Indonesia ranked fourth globally, up from sixth in May 2023, with 25.25 million users.The United States remained the top country with 98.5 million Twitter users in July 2023.Japan ranked second with 67.9 million users, followed by India with 26.45 million users.Brazil and the United Kingdom had 22.95 million and 22.7 million users respectively.
The FIFA World Cup is an international-level football competition that gathers national teams from around the world.All teams that successfully qualify will be able to participate in this tournament to compete for the title of champion.The FIFA U-17 and U-20 World Cups are international football tournaments for youths classified by age.Football players in the national team who are under 20 years old are referred to as "U-20" [6].The FIFA World Cup is participated by the senior men's national teams from every eligible country.FIFA is the highest football organization in the world that organizes this championship [7].As one of the biggest and most anticipated sports events in the world, the FIFA World Cup has become a prestigious platform for the host countries [8].The host country selected by FIFA will hold an event, featuring a special soundtrack or song for each FIFA World Cup.The FIFA World Cup showcases a traditional highlight known as the soundtrack or theme song for the event, typically created by the host country.FIFA has chosen to use a soundtrack produced by an Indonesian music group called Weird Genius titled "Glorious," sparking comments from the general public, especially football fans.This decision has not only triggered various public responses regarding the soundtrack but also raised concerns about various aspects such as Argentina, as the host country, experiencing an economic crisis that will impact facilities and infrastructure.
To ensure FIFA and the host of the FIFA U-20 World Cup are effectively steering their actions towards their objectives, Twitter discourse provides a plethora of comments ripe for analysis.These remarks can shed light on how FIFA's choices resonate with the tournament's image, influence the music scene, and impact the host nation of the FIFA U-20 World Cup.Therefore, sentiment analysis is needed to understand the public's response more deeply to FIFA's decisions and the host's role in organizing the U-20 World Cup.By considering various comments on Twitter, patterns of positive, negative, or neutral sentiments that emerge in online discussions can be identified.This can provide valuable insights for FIFA and the host to adjust their strategies to better support their shared goals in running this global football event.
This study analyzes sentiment in Twitter comments using the Support Vector Machine (SVM) algorithm.To produce representations of different classes in multidimensional space, the SVM approach was chosen [9].The generated hyper lane functions as a boundary and assists in classifying and separating between classes [9].The Support Vector Machine (SVM) is capable of effectively delineating classes by creating a linear boundary with a substantial margin and is also able to handle feature vectors of unlimited dimensions [10].Furthermore, labeling using the lexicon method is used to ensure the magnitude of expressed sentiment, where words are weighed beforehand [11].This study aims to ascertain the mood that was prevalent during the FIFA U-20 World Cup in Argentina and to impart knowledge to audiences that are unfamiliar with the use of the SVM algorithm as a tool to help FIFA and the nation hosting the tournament accomplish their objectives.The expected outcome of this research is to provide further information on Twitter users' responses and reactions to various topics, such as Argentina's decision to select a song for the FIFA U-20 World Cup soundtrack and how it impacts the country's image, local music industry, and tournament organization.

B. Research Methodology
This research methodology involves a series of stages for sentiment analysis, as illustrated in Figure 1.The initial stage of this sentiment analysis research method is when the user inputs links from Twitter or platform X social media for dataset extraction using crawling to obtain raw datasets.These raw datasets are then entered into the Pre-Processing stage to be processed into clean datasets.The stages in Pre-Processing include Data Cleaning, Case Folding, Tokenizing, Language Normalization, Stopword Removal, Stemming, and Labeling.After Pre-Processing is done, the dataset will be entered into the Weighting stage using the TF-IDF method.This stage aims to extract the necessary TF-IDF features to identify and extract the most important keywords in a document or dataset.Next, after obtaining the weighting results, the dataset will be automatically processed for classification using the Support Vector Machine method.After the classification process is completed, an Evaluation of the SVM Method and sentiment classification results on the dataset will be provided.The data used originates from Twitter URLs or X https://twitter.com/,which serve as the primary data source.The main data for this research was obtained from collecting Twitter data through the Twitter API, focusing on approximately 2,400 tweets related to the FIFA U-20 World Cup held in Argentina.This dataset uses tweets Indonesian language text.

Pre-Processing
The pre-processing phase in text analysis involves various steps, including Data Cleaning, Case Folding, Tokenizing, Language Normalization, Stopword Removal, Stemming, and Labeling.This stage is crucial in preparing text data for further analysis because it allows for noise removal, format standardization, identification of information units, reduction of lexical variations, elimination of irrelevant words, normalization of words, reduction of feature dimensions, and providing appropriate labels for the desired analysis purposes.
Step-1: The Data Cleaning stage aims to prepare data for analysis by removing invalid, incomplete, or irrelevant data and addressing issues such as duplicates and outliers that can affect the integrity and interpretation of the data.Proper data cleaning ensures a clean and consistent dataset ready for processing, enhancing the quality of analysis, and reducing errors as seen in the results in Table 1.Step 2: Case Folding is a text standardization process that converts all characters to lowercase, excluding non-alphabetic characters.This process aims to remove URLs, numbers, and punctuation marks for text document consistency.The Case Folding stage aims to change letters to lowercase or uppercase as seen in the results in Table 2. Step-3: Tokenization is the process of dividing text into smaller pieces known as tokens.Depending on the precise requirements of the text processing activity being performed and the complexity of the analysis, tokens might be words, phrases, or characters.As demonstrated by the findings in Table 3, tokenization also makes it possible to normalize words and eliminate superfluous characters, which facilitates text processing and increases analysis accuracy.

Case Folding Tokenizing logonya pildun ini justru lebih bagusan yang argentina ketimbang di indo ['logonya', 'pildun', 'ini', 'justru', 'lebih', 'bagusan', 'yang', 'argentina', 'ketimbang', 'di', 'indo']
Step 4: Language Normalization is the process of representing words, including converting words in the text to standard forms to ensure consistency for computational systems.Language Normalization involves changing the format and removing unnecessary elements such as punctuation and stop words as seen in the results in Table 4. Step-5: Stopword Removal is a process aimed at improving computational efficiency by eliminating unimportant words.Its focus is on removing keywords such as "and" and "or" for clearer text analysis.Removing keywords enhances analysis relevance by allowing focus on important words.This process aids in understanding content and extracting accurate information.Additionally, it enhances text processing performance by reducing feature space.Eliminating stopwords is crucial for efficient text analysis and interpretation, as evidenced by the results in Table 5.

Normalisasi Bahasa Stopword Removal ['logonya', 'piala dunia', 'ini', 'justru', 'lebih', 'bagusan', 'yang', 'argentina', 'daripada', 'di', 'indonesia'] ['logonya', 'piala dunia', 'bagusan', 'argentina', 'indonesia']
Step 6: Stemming is a process aimed at reducing words to their base form by removing inflections and affixations.Stemming helps identify words with the same root, enhancing efficiency in text analysis.Its goal is to simplify text comparison and effectively extract meaning.Stemming simplifies words such as "run" to enhance text analysis without losing meaning.It also improves consistency and reduces complexity in language models, as evident in the results in Table 6.Table 6 Step 7: Labeling, the process of assigning labels to sentences or words based on the expressed emotions or attitudes, known as sentiment analysis.The main goal is to identify whether the text conveys positive, negative, or neutral sentiments, sometimes using Lexicon-based methods as seen in the results in Table 7.

TF-IDF Weighting
The method known as Term Frequency-Inverse Document Frequency (TF-IDF) plays a crucial role in transforming textual data into numerical vectors, thereby enhancing the efficiency and accuracy of various text processing activities [11].Renowned for its effectiveness, user-friendliness, and precision, the TF-IDF method involves careful calculations of Term Frequency (TF) and Inverse Document Frequency (IDF) for individual words across documents, enabling the determination of the frequency of each word in the document as seen in the results in Table 8.

Support Vector Machine
Support Vector Machine (SVM) is a machine learning technique designed to identify the optimal hyperplane that divides two classes in the input domain, and its classification algorithm constructs a model based on training data to predict the classification of new unseen test data samples [12].The distance between each class's nearest data point and the hyperplane, also referred to as the support vector, defines the term margin [13].The prominent lines indicate the optimal hyperplane equidistant from two classes.[13].Support Vector Machine (SVM) works especially well in high-dimensional spaces, showing good performance when the number of dimensions is greater than the number of samples.Additionally, SVM's ability to utilize subsets of training points enhances memory efficiency [14].
The classification outcomes of the Support Vector Machine (SVM) model function as the central point for an extensive analysis to fully comprehend the underlying patterns in the dataset.This evaluation process often necessitates careful examination of both positive and negative sentiments expressed in the text, as well as the identification of key features that significantly influence sentiment determination.Additionally, it involves delving deeper into the reasons behind inaccuracies observed in the classification results, to enhance the overall understanding of the classification process.

Evaluation Model
After the Support Vector Machine (SVM) modeling process, the results are evaluated using the Confusion Matrix.The accuracy value for the data is determined by calculating the ratio of correctly predicted data to the total data, including both successful and unsuccessful predictions.The classification outcomes are represented by four terms: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).Equation (1) will be used to assess the accuracy, which is the frequency with which the classification model correctly predicts the data class.Based on Equation (2), we will assess the classification model's predictive accuracy for positive outcomes.Equation (3) may be used to calculate the recall, or the percentage of real positive data that the model correctly predicts to be positive.Equation ( 4) is used to calculate the F1-score, a statistic that uses the harmonic mean to balance recall and accuracy.

C. Result and Discussion
The method used for sentiment analysis in this study involves employing a Support Vector Machine with lexicon-based labeling.In the data collection stage, the author reviews the information available on social media platforms such as Twitter or Xapps.In this research, using the data collection stage, the author conducted a review of information available on social media platforms such as Twitter or X.This study successfully gathered a dataset consisting of 2400 user reviews related to the FIFA U-20 World Cup in Argentina found on various social media accounts on the platform using the Crawling method, as seen in Table 9.It's important to note that the dataset used in this study consists of Indonesian language texts.This dataset comprises a variety of user comments and perspectives on the event.A total of 2,400 review data were utilized in the study.These were subsequently divided evenly, with 50% allocated for training data and the remaining 50% for test data, ensuring balanced representation throughout the analysis phase.To maximize the results of sentiment analysis, researchers utilize various processes.The first process is pre-processing aimed at cleaning and organizing text data for easier analysis.In this pre-processing, there are several stages aimed at improving the quality of sentiment analysis by preparing the data appropriately.The stages include data cleaning, lowercase conversion, tokenization, language normalization, removal of common words, stemming, and labeling.
The second process is weighting using TF-IDF on the cleaned dataset to obtain numeric representations of words in the text.By using TF-IDF, researchers can identify the most important or unique words in each document, which can assist in sentiment analysis by giving appropriate weights to those words.After TF-IDF weighting, feature extraction is performed followed by classification using the Support Vector Machine (SVM) method with a linear kernel and equipped with oversampling techniques commonly used to address imbalance issues using oversampling RandomOverSampler (random_state=50).The accuracy obtained before using oversampling methods is 50%, but after using oversampling methods, the accuracy increases to 85%.
Based on the sentiment analysis results, it can be stated that the performance of the SVM method using a linear kernel with Twitter user review data or X for research related to the U-20 World Cup in Argentina has been successfully implemented.In this investigation, preprocessing is conducted using 2,400 data points for cleaning and classification, employing the Support Vector Machine (SVM) approach.A 50:50 ratio of training data to test data is utilized.The training data handling procedure involves generating synthetic samples from minority classes using the SMOTE approach to address overfitting issues in imbalanced data settings.The categorization outcomes are presented in Figure 2's Confusion Matrix Table 10, shows that negative sentiment accounts for 63.2%, while neutral sentiment is 20.8%, and positive sentiment has the smallest value at 15.9%.

D. Conclusion
This research conducted sentiment analysis using the Twitter social media platform, resulting in a dataset consisting of 2400 comments.Following the sequence of steps: Pre-Processing, TF-IDF weighting, and implementation of the Support Vector Machine (SVM) classification model with a 50:50 split between training and test data.The SVM algorithm demonstrated the ability to categorize sentiment with an accuracy rate of 85.71%, precision of 85.98%, recall of 85.71%, and an F1-Score of 85.58%.From the sentiment analysis using the SVM algorithm with a linear kernel, it was found that the obtained negative sentiment was 63.2%, neutral sentiment was 20.8%, and positive sentiment had the smallest proportion at 15.9%.The results of this research are expected to provide insights into the world of football, especially for football organizations and countries hosting football competitions, aiming to enhance satisfaction among fans and sports enthusiasts.Therefore, the SVM algorithm emerges as a highly efficient instrument in examining sentiment in Twitter comments, showing potential for further utilization in designing solutions related to improving football fan satisfaction.

E. Acknowledgment
This research was supported by the Informatics Department, Faculty of Engineering Universitas Dr. Soetom.We would like to thank GitHub for helping us out and for allowing us to use their dataset as the foundation for our study.

Figure 1 .
Figure 1.Research Methodology 1. DatasetThe data used originates from Twitter URLs or X https://twitter.com/,which serve as the primary data source.The main data for this research was obtained from collecting Twitter data through the Twitter API, focusing on approximately 2,400 tweets related to the FIFA U-20 World Cup held in Argentina.This dataset uses tweets Indonesian language text.

Table 2 .
Case Folding