Financial Banking Dataset for Supervised Machine Learning Classification

Social media has opened new avenues and opportunities for financial banking institutions to improve the quality of their products and services and to understand and to adapt to their customers’ needs. By directly analyzing the feedback of its customers, financial banking institutions can provide personalized products and services tailored to their customer needs. This paper presents a research framework for creation of a financial banking dataset in order to be used for Sentiment Classification using various Machine Learning methods and techniques. The dataset contains 2234 financial banking comments from Romanian financial banking social media collected via web scraping technique.


Introduction
With the explosive growth of social media (i.e. reviews, forum discussions, blogs, microblogs and social networks), customers are encouraged to express their thoughts online and also exchange their opinions regarding financial banking products and services. Thus, there is a large amount of financial banking data containing opinions generated from a variety of social media sources. Customers or potential customers are significantly influenced in their decision making regarding a product or a service by reading the online reviews to find out others' opinion about that product or service. [1] The Romanian financial banking marketplace is constantly evolving to a digital marketplace. Financial banking customers are able to share their thoughts online and also perceive and receive others' opinions on different portals. Therefore, there is a need to explore the opinions insights from Romanian financial banking comments and to find out what people discuss on Romanian financial banking forums. The information available on the Web consists predominantly of unstructured text. A significant challenge is collecting the needed information from different web pages with very heterogeneous formats in a structured way. Gathering data is the most important step in solving any machine learning problem. In the past few years, various researchers and practitioners have investigated different methods for data collection from online sources. One of the most popular methods to retrieve Web content at scale is web scraping i.e. the automated and targeted extraction of data. Various frameworks and Application Programming Interfaces to develop customized scrapers, as well as configurable ready-to-use scraping tools exist. Comprehensive overviews of frameworks and tools for different extraction tasks are presented by Glez-Peña et al. in [2] and Haddaway in [3]. Using Scrapy framework, an extraction of 2234 financial banking comments in Romanian language between June 2009 and April 2018 has been performed. The financial banking dataset contains data from the Conso portal, which is the most popular forum in financial banking social media from Romania. This research paper presents the framework used for creation of financial banking dataset from Romanian marketplace in order to be used in Machine Learning context for Sentiment Classification and social media analytics. To the best of our knowledge, there is no available dataset for sentiment classification of Romanian financial banking customer reviews. The financial banking dataset is available in Romanian and English on Kaggle, one of the largest and diverse data communities for Machine Learning researchers and practitioners. The Romanian dataset is available at 1 https://www.kaggle.com/iryna13raicu/financialbankingcommentsro and at https://www.kaggle.com/iryna13raicu/financialbankingcommentsen for English version.

Research Framework for Financial Banking Dataset Creation
The main research goal is to identify the relevant financial banking data from Romanian online sources in order to obtain consistent data of financial banking customers in a structured format to be analyzed and classified in a Machine Learning context. Thus, a Financial-Banking Dataset Creation framework is proposed. The framework is illustrated in Figure  1.

Fig. 1. Financial-Banking Dataset Creation Framework
The main stages of the framework are:  Data Sources Identification  Data Extraction  Data Storage  Data Exploration In the following subsections, the stages for dataset creation are detailed.

Data Sources Identification
The first step of research framework is the identification of relevant financial banking data sources which contain various discussion topics where users' experiences about various institutions, their offered conditions and their customer services with focus on Romanian marketplace. The main data sources which contain information regarding products and services from Romanian financial-banking marketplace are illustrated in Table1. The most relevant financial reviews were found on the Conso (www.conso.ro) portal. Conso is an extremely popular and reliable website in search of financial banking knowledge with reasonable number of users. The portal provides a general overview regarding financial and banking products and services from Romanian marketplace such as:  Different types of loans (personal loan, mortgage, car loan, leasing, real estate loan) -Savings (deposit, saving accounts, SME deposit)  Payments (debit card, internet banking, free cash withdrawals)  Investment (investment funds, pensions, insurance)  Other kind of information provided by Conso is related to utilities (electricity), exchange rate, and evolution of exchange rate, guides for various types of loan, cards, and savings. Conso portal is structured as a standard website with a similar section to a web forum dedicated to users' reviews (Vocea Clientului in Romanian) where the users can express their own opinion regarding a bank or a financial institution and their financial banking products or services. An overview of financial banking forum from Conso portal is illustrated in Figure 2. The customers interested in different financial banking products or services are saving time when visiting this website and compare the offers. They can compare offers regarding real estate loan, mortgage, and personal loan, refinancing credits, private pensions, deposits, cards, personal investment, auto loans, SME loan and electricity. Also, the users can express their own opinion regarding a bank or a financial institution and their financial banking products or services.

Data Extraction
Recently, web scraping methods have emerged as a promising way for online data collection. Scraping technique is considered as one solution to collect data in an automatic way from various Internet resources. Web scarping known also as data extraction or web crawling is the process for finding and extracting data from the web page in a structured format such as files or database. Another definition of web scraping states transforming process of the semi-structured documents from the web pages into a markup language such as HTML or XHTML, and analyzing the document in order to obtain the needed information. [4] A web crawler also known in other terms such as indexers, web spiders or web robots is an automated program, application or script that automatically scans through web pages to collect information. The web crawler searches and extracts data from web pages, navigating from URL to URL, according to some predefined algorithms. The main role of the web crawler is to simplify and to automate the entire crawling process and makes the data crawling easy and accessible to everyone. According to the [5], crawlers are classified into two main categories: classical (traditional) crawlers and focused crawlers. The classical (traditional) web crawlers navigate the webpages and gather both relevant and irrelevant information which is a huge waste of crawling time and the storage of the downloaded information [6] Focused crawlers do not crawl the whole web as opposed to the traditional crawlers, as they only crawl to the deepest the specific part of the web that is related to the given topic.
The web scraping process consists of two main steps: fetching, downloading the web pages and parsing the obtained data to the desired information. Web scrapers are software programs with similar functionalities of web crawlers also with some differences illustrated in Table 2. A significant challenge is to collect in an automatically way the financial banking information from Conso portal using a web scraper. Various frameworks and Application Programming Interfaces to develop customized scrapers, as well as configurable readyto-use scraping tools exist. The platforms for scraper implementation differ from one another in terms of scalability, flexibility and their performance in different scenarios. Robustness and Politeness are properties that every web scraper must provide. [7] Other properties that should be considered when implementing a web scraper are: performance and efficiency, distributed, scalability, quality, and extensibility.
Scrapy is an open-source web crawling framework based on Python programming language for scraping massive amounts of data from various sources in a robust and efficient manner. [8] Scrapy is an integrated system that includes an engine for controlling the data flow between all the components, a scheduler for receiving requests, a downloader for fetching web pages and custom classes (called spiders) written by users to parse responses and to extract data.
[9] In the literature, there are many approaches which use web scraping based on Scrapy framework for data collection. Landers et all [10] proposed an approach called theory-driven web scraping in order to collect data regarding substantive theory for psychologies. An interesting domain where web scraping gained attention is criminal justice [11]. Web scraping is widely used also in eCommerce applications.
[12] [13]. Data scraping was successfully applied in real-estates domain. [14] Extracting information in a structured format from Conso portal involves retrieving automatically the links that lead to posts and obtaining the actual data objects of those posts. For this purpose, a web scraper has been implemented using Scrapy framework. For the implementation of the web scraper, Anaconda distribution has been used. The web scraper implementation is available at https://github.com/irina-raicu/ATLAS/scraping The web scraper is based on a focused crawler for collection of all web pages from Conso portal where financial banking posts are posted and on a parser for extraction of needed text from the entire website. The task of capturing and structuring data extracted from the Web is divided into two parts: crawling and data scraping. Crawling effortless multiple URLs from Conso portal by avoiding non-informative data and duplicate pages, scraping a huge amount of data automatically using CSS selectors to select relevant data related to different banking services and products such as: loans, deposits and cards, storage of data into a specific structured format such as JSON, an exploitable structure to facilitate data processing and analysis, are provided by Scrapy. Others relevant requirements for implementation of a web spider for our application are related to CPU and memory usage, and speed. Memory and CPU requirements of Scrapy follow the amount of data that is needed for a multithread application, also speed requirement of Scrapy in automatically navigation of dynamic URLs and data extraction is satisfied.

Data Storage
As described to the previous stage, an extraction of 2234 financial banking posts in Romanian language from Conso portal between June 2009 and April 2018 has been performed using Scrapy framework. For each review, the following information has been collected into a JSON format. (1). Entire text of the review (2). Date of the review (3). Financial banking institution (4). Financial banking product (5). Characteristics that have been evaluated by the users (6). Rate of each characteristic (7). Star-rating of the reviews The following example is a review in Romanian language from Conso portal in JSON format: { "text": "Neserioasa banca! Prima banca care comisioneaza incasarea salariului, desi isi fac publicitate ca nu au nici un cost.", "autor_opinie": "de Florin Stefan (Braila)", "data_opinie": "20 Aprilie 2018", "banca": "Banca Transilvania", "produs_bancar": "incasarea salariului", "review_total": "1", "caracteristica1": "Transparenta costurilor", "nota_caracteristica1": "1", "caracteristica2": "Timpul de asteptare", "nota_caracteristica2": "1", "caracteristica3": "Functionarii institutiei financiare", "nota_caracteristica3": "1 ", "caracteristica4": "Procedura de lucru", "nota_caracteristica4": "1", } As aforementioned before, the aim of the extracted data is to be used in Sentiment Classification using supervised machine learning methods and techniques. Supervised learning mechanism identifies specific relationships or structure in the data received as input in order to effectively predict correct output data. Therefore, the importance to supervised learning of having access to labeled data is paramount. [15] On Conso portal, each post has associated a star value provided by the financial banking user. Comments posted on Conso portal uses a star system based on a scale one to five for a review. Also, it is important to notice that not all the posts represent opinions regarding a financial banking service or products. Some users post updates regarding legal ordinances or other relevant legislation modifications, petitions, suggestions. In addition, financial banking representatives have a dynamic interaction with users through Conso and they offer responses to users regarding financial banking product or service or other relevant information.
The collected data serves the input for supervised classification and thus, there is a need that each post from Conso portal to be labelled (concept known also as "annotation"). Most common annotation for sentiment classification is classification in polarity classes positive or negative or positive, negative or neutral. Thus, in the dataset, each post is annotated with opinion labels (positive, negative or neutral in order to capture the polarity of subjective texts. To deal with objective texts, each post is annotated with factuality labels (opinions, facts and experiences). The annotation for opinion labels is performed automatically using SentiWordNet lexicon. SentiWordNet is a lexical resource that associates to each sense of a term scores according to the notions of positivity, negativity and objectivity. [16] The application used for labelling the customer comments is described in Figure 3.

Fig. 3. Automatic Opinion Labels Annotation of Romanian posts
Factuality is successfully applied in sentiment classification in order to capture the polarity of objective texts. [17] Even if fact, opinion and experience concepts are very similar; there is a distinction among these categories.
In certain application domains facts may also have polar orientations, since they may have negative/positive implications for users (financial banking customers, in this case). For instance, adoption of a new legal ordinance to increase customer interest has a negative impact on financial banking customer from his point of view, because the costs will increase. Facts are considered objective information whilst opinions and experiences are subjective.
A fact represents information used as evidence. A fact can be proved.
Examples of facts from the extracted data are "You have to read the contract carefully" (translation from Romanian original post "Trebuie sa citesti atent contractul ») and "Regarding the message of Mr. Narciz Bejinariu, we will communicate the following: Loan interest rate is retained. Depending on the supporting documents submitted by the client on request for restructuring, a total grace period (no principal or interest) or partial interest (up to 12 months) can be granted. In sucursala doamna consilier a fost foarte draguta intampinandu-ma cu zambet chiar si in momente mai tensionate. Apreciez in mod deosebit modul profesionist si politicos in care am primit explicatii si raspunsuri la intrebarile mele. Recomand Agentia Orizont Brasov.La inceput am dat 1 steluta acum dau 5 stelute! Multumesc!") and "Hello. I have the same unpleaseant experience with BCR. Internal rate of 8 9% plus margin. I join with those who want to do something against BCR. We propose to mediate this case of BCR and through media channels to send a joint news to newspapers. We should set up a common meeting for all parties together with a lawyer and file complaints at the bank and possibly newspapers."(translation from Romanian orginal post "Buna ziua. Si eu am aceeasi experienta trista cu BCR-ul. Dobanda interna de 8 9% plus marja. Ma alatur si eu celor care vor sa faca ceva impotriva BCR-ului. Propun sa mediatizam acest caz al BCR-ului si prin canalele mass-media eventual sa trimitem o sesizare comuna posturilor de televiziune ziarelor. Ar trebui sa stabilim o intalnire comuna toti cei patiti impreuna cu un avocat si sa depunem sesizari la anpc banca si eventual televiziuni ziare. ») As aforementioned, experiences are widely shared among the financial banking community users (even more than opinions). Therefore, a new category to the traditional categorization of facts vs. opinions is added. It is possible that, when describing an experience, the user also expresses an opinion, so that experiences are not always mutually exclusive from opinions. If an experience includes an opinion, the annotator is asked to label the sentence as "experience". On the other hand, "facts", "opinions" and "experiences" may be positive, negative or neutral, depending on whether they express or arose positive, negative or neutral sentiments and feelings, respectively. The annotation process for extracted data is performed manually by a specialized person in financial banking domain. There is a possibility that, when describing an experience, the customer also expresses an opinion, so that experiences are not always mutually exclusive from opinions. If an experience includes an opinion, the sentence is labelled as "experience". Table 3 shows examples of positive, negative and neutral facts, opinions and experiences.

Data Exploration
The aim of this stage is to ensure that dataset is reliable for training a machine learning model and obtaining useful predictions. Thus, several metrics such as duplicated text, completeness of text values, text format, and ambiguity text detection have been applied to the financial banking dataset.
The chart in Figure 4 shows the distribution of "fact", "opinion" and "experience" classes for each comment. As expected, customers are posting comments about their experiences (1324 of reviews) comparing to facts (421 reviews) and opinions (488 reviews). The most commented financial banking products are:  Savings and loan (Translation from Romanian "Economisire-Creditare")  First House loan (Translation from Romanian "Credit Prima Casa")  Current account (Translation from Romanian "Cont curent")  Real estate loan (Translation from Romanian "Credit imobiliar")  Personal loan (Translation from Romanian "Credit de nevoi personale") The figure 6 shows the distribution of financial banking institutions commented by the customers on Conso portal.

Conclusions and Further Research
Nowadays, with the proliferation of financial banking reviews, microblogs, forum discussions, blogs, social networks and other forms of expression, more and more customers become aware growth of web's popularity. For instance, before making a decision regarding a financial banking product or a service, customers explore other opinions regarding that product or service. [17] Sentiment Classification is successfully used in identification of customer feelings regarding of a product or a service. [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32] Analyzing the customers feedback's and expectations is a major aid in measuring overall performance, sales and improving financial banking institutions marketing strategies, especially on their online presence. This paper introduces the research framework for a dataset creation in a financial banking domain in order to be further used in a Supervised Machine Learning context. To the best of our knowledge, there is no available dataset for sentiment classification of Romanian financial banking customer reviews. The dataset contains 2234 financial banking posts in Romanian language from Conso portal between June 2009 and April 2018. To build the dataset, web scraping technique based on Scrapy framework is used. The extracted information is so vast and it contains not only customers' opinions, but also experiences and facts (e.g. discussions about petitions, legislation modifications, etc.). Therefore, several posts are subjective texts (i.e. "opinionated information") whilst others are objective texts (i.e. "factual information"). Each post in the dataset is annotated as: "Positive", "Negative" or "Neutral" and "Opinion, Experience or Fact" in order to capture both subjectivity and objectivity of customers' reviews. As further research, the main objective is to build supervised machine learning classification models for Sentiment Analysis in order to explore opinions insights from financial banking customers. This is particular challenging because the models have to deal with classification of reviews in Romanian language and imbalanced class distribution.