WEB PLACE NAME DICTIONARY IMPLEMENTATION USING TWITTER AS SOURCE TO DEVELOP TRAFFIC INFORMATION SYSTEM

Congestion is one of the severe problems that occurs in big cities like Jakarta, the center of the Indonesian government. One alternative solution to solve this problem is with a Traffic Information System which is easily accessible by the public. Twitter as a social media can be used as one of information source to develop the Traffic Information System. The most common problem in the processing of information on Twitter, especially in Indonesia, is the use of abbreviations and typographical error (typo). We propose an application of Web Place Name Dictionary in addressing abbreviations and typos in the name of the location contained on Twitter to develop a Traffic Information System. Three cases in find coordinate point in the street that derived from location name in Twitter can be handled through the proposed system. This system produces 90% accuration rate on 100 data set that obtained via @TMCPoldaMetro. This study only focus on the implementation of Web Place Name Dictionary method so we do not compare our system with the other methods.


Introduction
Congestion is the impact of the imbalance between capacity of the road and the number of passing vehicles.Currently the length of roads in Jakarta are only 7,208 kilometers [1], while the number of motor vehicles until April 2012 is 13,346,802 [2].Based on that facts, we will produce a ratio of 1,851 vehicles per kilometer.The number is very far from the ideal ratio.The ideal ratio is less than 100 vehicles per kilometer [3].
The rapid development of technology makes price of mobile devices become more affordable.Hence, the use of social media Twitter increases.Various informations are distributed to the public through the media, including information of traffic conditions shared by @TMCPolda-Metro, @Le-watMana and @SonoraFM92.This phenomenon can be used as an opportunity to develop a Traffic Information System to fulfill the needs of information regarding traffic conditions on the roads that are going to be passed.
The most common problem in the processing of information on Twitter, especially in Indonesia, is the use of abbreviations and typographical error (typo).There are some methods that can be used to solve this problem, such as utilization of geotag, utilization of tagging by tagger, and construction of a web place-name dictionary [4].Web place-name dictionary is chosen to be a method because Indo-nesian people seldom use geotagging feature and usually use non-standard sentence that causes tagger tagging process to be difficult to implement.
Similar studies have been carried out by [5].Endarnoto used Natural Language Processing to identify the name of the location being mentioned in a Tweet.However, the research conducted does not include how to obtain precise location coordinates of a location name that has been obtained through Twitter.

Methodology OpenStreetMap
OpenStreetMap (OSM) is a free and editable map that allows user to access the entire dataset on the map (latitude and longitude coordinates) around the world [6].Disclosure of this dataset can be used as a source of data and information about coordinates of path and name of street in Jakarta.The structure of the node in OSM can be seen in Figure 1.
A street is represented as a collection of nodes that have latitude and longitude coordinates, the nodes is ordered from starting point to the end point of the street.The structure of OSM street can be seen in Figure 2.
MapQuest is one of the leading providers of online mapping services that support OSM [7].MapQuest provides several features to support developers to build applications, both desktop applications, web and mobile.JavaScript 6.0 SDK is available to develop applications on the client-side.While on the server-side developers can develop an application using C++, .NET and Java SDK [8].In determining the coordinate of a point on the street that represents information from the Twitter, a me-chanism is needed to calculate the distance between two coordinate points via existing roads.One of the services that can be used is a web-service which is provided by yournavigation.org.
Yournaviagation.org use Gosmore routing machine as routing engine so that the service is more lightweigh, scalable and fast [9].There are some parameters is used in this web service: • flat = latitude from start location • flon = longitude from start location • tlat = latitude from destination location • tlon = longitude from destination location • v = kind of vehicle • fast = 1 for fastest route, 0 for shortest route • layer = the selection of Gosmore agencies that used to calculate the route The yournavigation.orgweb service returns a kml file that has a structure shaped like xml, so the file is easy to understand.One of the data used in developing this system is the value of the distance from two point coordinates.The sample of yournavigation.orgkml file can be seen in figure 4. In fact, there are 3 kinds of Tweet that will be informed by @TMCPoldaMetro.The kinds are Tweet with one location name, two location and three location names.The Tweet that has two or three location names will be processed by utilizing the web service.

Web Place-name Dictionary
One of the solutions that can be proposed to overcome the problem of finding matching location name by raw location name obtained from Twitter is Web Place-name Dictionary [4].Web place-name dictionary is made by collecting the sources of the location name, then using it as a database that will be accessible to search for the location names that are similar to a given word or query.
In developing this system, the name and street coordinates are obtained by utilizing the data sets held by OpenStreetMap data on a particular area of Jakarta.The data used can be obtained through http://downloads.cloudmade.com/asia/southeastern_asia/indonesia/jakarta_raya/jakarta_raya.highway.osm.bz2.Data that will be used in this study is only the primary and secondary data path of the street that is known by tags <tag k="highway v= "primary"/> and <tag k="highway" v="secondary" />.

String Matching Algorithm
String Matching Algorithm is an algorithm for searching string that have the highest similarity to the given query string.This algorithm is used to solve the typo and abbreviation problem that often happens on Indonesian Tweets.String Matching Algorithm consists of two main processes, search string algorithm to find candidate result and the calculation of similarity between each candidate string and query string.The string matching algorithm flow details can be seen in Figure 6.
Q-grams method is an algorithm that is often used in search string candidates.Q-grams chops the string into several substrings with each sub-string consists of q characters.Q-grams method is divided into two types, overlapping and consecutive.In the overlapping Q-Grams, a q-gram will begin at each position in the string.While for consecutive Q-Grams, a q-gram will begin at each position that is multiple of q [10].On a String "juanda" overlapping 3 -grams will yield {(jua), (uan), (and), (nda)}, whereas if using consecutive 3 -grams will yield {(jua), (nda)}.The value of Q can be changed as needed, but in this study we used Q = 3.This is based on the characteristics of words in Indonesian and street naming in Indonesia.
There are three methods commonly used to calculate the similarity between two strings; Levenshtein Distance, Jaccard, and Cosine.Levenshtein Distance algorithm or often called Edit distance describes the similarity value between two strings based on minimal number of insert, delete, and substitution process to transform a string into another string [11] [12].
Edit distance ( , ) between string = , … , and = , … , where , ∈ Σ * and is alphabetic character.We can construct a matrix with size … , … to calculate ( , ) where the value of edit distance is at , .The calculation of each cell at the position , is calculated based on equation 1, where the value of Even if we get the value of the edit distance, we have not been able to assess whether the two strings are similar or not because the value of the edit distance is relative to the length of the string.We need the equation to calculate how similar the two strings.One of the equations that can be used is  Unlike the Levenshtein Distance algorithm, Algorithm Jaccard and Cosine do not count from letter to letter but the grams that are formed.Set X is the result of Q-grams method to first String, and Set Y is the result of Q-grams method to second String.Jaccard algorithm calculates the similarity value by the equation: Cosine algorithm calculates the similarity value by the equation: In this study, we only use Levenshtein Distance algorithm because the sequence of the Q-gram made by Q-gram method is an important point.

Efficient Approximate Candidate String
In developing this kind of system, efficiency is certainly an important factor.One method that can be applied is the Scan Count and Divide Skip, which is an algorithm proposed by Li in 2008 [13].
Scan Count Algorithm counts the number of occurrences of each string in the database that contains the elements of a given set of queries grams.Scan Count Algorithm is then developed into Divide Skip.Divide Skip is an algorithm where filters are added into Scan Count algorithm in advance to speed up the calculation.We applied some modifications to suit the environment of the experiments conducted.The pseudo code of Divide Skip algorithm can be seen in Figure 6.
In general, Divide Skip algorithm discards non potential data by dividing the entire list into two parts, list with a little number of members (L short) and list with the lot number of members (L long).The main problem in this algorithm is the selection of the value of L so that the algorithms can run properly and efficiently.
Li stated that the value of L can be found using the equation 5: (5) M is the length of the longest list, and @ is an independent coefficients.Based on the results of the experiment that have been carried out by Li, the value of @ is 0.0085.

Problem Identification
Problem identification is done by analyzing the pattern of the provision of information via Tweet from the account @TMCPoldaMetro and methods that have been applied to map the location of the street based on information from social media.The methods that have been applied in Indonesia is utilizing Google API with location or street name as an input.
The use of Google's API does not guarantee that the precise location can be found as a result of the presence of the typos and abbreviation of the name of the location/street.
There are three cases that must be solved in this system.The case with one location name, two locations name, and three locations name.
Tweet with one location name usually informs the overall road conditions (Figure 8).Tweet with two locations usually informs the certain path on the way (Figure 9).Tweet with three locations

System Requirement
In developing a system that is able to provide accurate coordinate information, a fairly in-depth analysis of the pattern of information provided through Twitter is required.In general, the system is built to meet the following specifications: • The system is able to search the street name accurately although there are some typos or abbreviation.
• The system is able to locate specific coordinates on the road with one input location na-me.• The system is able to locate specific coordinates on the road with two input location na-me.• The system is able to locate specific coordinates on the road with three input location name.
• The system is able to visualize on a map the coordinates of the intended path.

Systems Design
In developing this system, the primary focus is the process of finding a location or street name that corresponds to what is meant by information providers.The flow diagram is shown in figure 9.
The value of T that is used in the flow can be obtained by calculating the result of (query.length-2)/2.This means that there are at least a candidate string containing half of gram that are formed from a given query.The sistem has four main flows; flow on the input processing, flow on finding coordinate by one location name, flow on finding coordinate by two locations name, and the last is flow on finding coordinate by three locations name.The system will return the coordinates that consist of latitude and longitude which considered as the location referred by a Tweet.
On the flow of processing one input query, the coordinates that will be taken are the start point and end point of a path.This decision is made because the location that is meant usually covers the entire road.If the road which is meant is a two-way street, then the system will do the search process Contrast to one input query processing, two input query processing utilizes yournavigation.orgweb service to determine the most appropriate location that is provided by a Tweet.There are two possibilities in this case; first the case where the location is from middle point to starting point or the second case is from middle point to end point on the road.
The determination of the case is based on the reference point given by the second query.The first query is a street that will be searched for its coordinates.If the road on the first query is a road with two ways, we must search which lane that is closest to the reference point on the second query.
The case of two input queries is similar to the case of three input queries.We use the services of yournavigation.orgto determine the specific coordinates on the way.In the case of three input query, the system will search the node that has shortest distance to the first reference (query 2) and the no-de with the shortest distance to the second reference point (query 3).

Implementation
There are some applications that are used in building this system such as Tomcat server, Xampp and MySQL.Implementation of the code is done by using the Java language.Figure 13 is an overall system view.
The prototype consists of two component; input menu for one input, two inputs, three inputs (figure 14) and the map to visualize the coordinate (figure 15).
The given input on the field will be sent to the server and processed to produce the coressponding location as shown in figure 15.

Testing
Testing of the prototype system is done by conducting a blackbox testing against the street name/location name as input.There are 100 street name /

Conclusion
Development of traffic information system using web place-name dictionary to determine the location mentioned in a tweet is able to handle the typos and abbrevations that are often done by Indonesian society.The system is able to reduce the inaccuracy that often occurs as a result of existing APIs such as Google Maps that could not solve the typos and abbreviations used by Indonesian people.In black box testing, the system is able to produce an accuracy of 90% with 100 data set that taken randomly.Utilization of web place-name dictionary can be developed in a variety of other systems which utilize the name of the location mentioned through Twitter to be processed.In this study, the system still has weaknesses because there are no location name (non street name) on OpenStreetName so the problem must be resolved.

Figure 1 .
Figure 1.Data structure of node in OSM

Figure 5 .
Figure 5. String Matching Algorithm Process Figure 6.String Matching Algorithm Flow Details

Figure 11 .Figure 12 .
Figure 11.Flow diagram for one input query

Figure 13 .
Figure 13.Flow diagram for two input query Figure 14.Flow diagram for three input query