Twitter location (sometimes) matters: Exploring the relationship between georeferenced tweet content and nearby feature classes

: In this paper, we investigate whether microblogging texts (tweets) produced on mobile devices are related to the geographical locations where they were posted. For this purpose, we correlate tweet topics to areas. In doing so, classiﬁed points of interest from OpenStreetMap serve as validation points. We adopted the classiﬁcation and geolocation of these points to correlate with tweet content by means of manual, supervised, and unsupervised machine learning approaches. Evaluation showed the manual classiﬁcation approach to be highest quality, followed by the supervised method, and that the unsupervised clas-siﬁcation was of low quality. We found that the degree to which tweet content is related to nearby points of interest depends upon topic (that is, upon the OpenStreetMap category). A more general synthesis with prior research leads to the conclusion that the strength of the relationship of tweets and their geographic origin also depends upon geographic scale (where smaller scale correlations are more signiﬁcant than those of larger scale).


Introduction
The so-called big data era, of which volunteered geographic information (VGI, cf.[25]) and more broadly user generated content (UGC, cf.[65]) can be seen as precursors, has brought with it a wealth of potential opportunities for data-driven research.A common observation with respect to such data is that they often come with an implicit or explicit georeference [35], and thus can be used to explore research questions of a geographical nature, such as the spread of flu [38,59], the locations and magnitudes of earthquakes [54], the nature of landmarks photographed and preferred by individuals [1,16,31], or even the prediction of elections [13,62].However, a healthy skepticism has also developed with respect to the true properties and meaning of this data (e.g., [26,35]) as it has become clear that the data does not, and indeed cannot speak itself.
In this paper we seek to contribute to this debate by exploring one commonly made assumption.Microblogging services, such as Twitter, are seen by many researchers as an excellent opportunity to link text to locations (e.g., [2,12,29,41,56,67]), especially where the information is sent from a mobile device and directly associated with coordinates.However, it is equally obvious that one doesn't need to be at the location of an earthquake to discuss or react to it, and in this paper we aim to explore the extent to which the content of microblogging texts relate to the locations from which they are sent.Our underlying hypothesis is that if it is possible to extract meaningful geographical patterns from microblogging texts, then it should also be possible to relate such texts to existing geographic context.If the latter is not the case, this would in turn suggest that patterns emerging from such data may be more strongly influenced by the underlying distribution of the data source rather than any process controlling variation in the locations of the content itself.
The above hypothesis leads us to the overarching research question addressed in this paper, which can be simply stated as follows: To what degree are the contents of individual microblogging texts related to their location?
In order to explore this question in more detail, we identify the following set of detailed research questions which we will explore in the remainder of the paper: (1) How can we represent spatial context in order to investigate the relationship between the information content and its surroundings?(2) How can individual texts be classified such that content can be related to surroundings?(3) Can we automate this classification process by means of machine learning?(4) Which learning algorithms would be best suited for such kind of automation?(5) Is there a corpus that allows us to appropriately substitute manual training data for the classification task?(6) Does the proportion of texts related to location-specific information show a decay over distance-in other words are the locations of the texts which relate to specific locations non-randomly distributed in space?
2 State of the art

Twitter as a research corpus
In recent years, microblogging has evolved into a key means of communicating both within the World Wide Web (WWW) and in the broader sphere of social media [32,71] (for example, microblog contents are often displayed or reported in traditional print and audiovisual media and used as ways to elicit audience responses).Microblogs are so-called because of their terse, but publicly available content which can usually be associated with an individual, and are published on the WWW.Twitter is perhaps the most famous example of such a service, with its 140 character limit.Characteristic features of this type of text are often the expression of subjective impressions, trivia, opinions, and information [32].The style of the texts is dominated by abbreviations, the use of hashtags indicating particular themes, (internet) slang and spelling mistakes [28,37].
As well as Twitter, other microblogging platforms exist such as Tumblr and Weibo, with similar services being provided by Facebook, LinkedIn, and GooglePlus.As Twitter is currently the microblogging platform with the largest number of active users, especially in our study area, we use texts that are published via this service as a research corpus in this paper.
The sum of all published microblogging texts (tweets) may collectively be considered a source of information about opinions and sentiments on products, politics, society, and events.There are many approaches, typically focused on some form of natural language processing (NLP), to the automated analysis of such texts.Metaxas and Mustafaraj [46] review applications for which the potential of an automated analysis of microblogging texts has been investigated.Examples include prediction of box-office revenues for movies [4], stock markets fluctuations [8], flu outbreaks [38,59], and even (political) elections [13,62], though the debate over the latter application demonstrates that some claims should be treated with care [23,33].Moreover, the so-called "Twittersphere" may be used in order to analyze opinions and sentiments by means of sentiment analyses [61].In this context, the correlation between the results of such analyses and events has also been investigated [40] and it has been argued that, for instance, for the purpose of disaster management, microblogging texts can support the decision support.Examples of research in this domain includes work addressing fire [19,34,64], floods [73], and earthquakes [54].
Besides such potential applications, the socio-economic characteristics of the Twitter user community have been explored.It has been shown that the intensity of the usage of the service within a specific region correlates with the average income and education of the population within this region [43].With regard to the main intentions of usage it has been found that daily chatter, conversations, sharing information, and reporting news dominate [32].This taxonomy has been confirmed in a second study which found user status updates, private conversations, weblinks to blogs and news, politics, sports, events, and advertising to be the main Twitter message types [17].

Relationship of geographic location and content of microblog-texts
About 1-3% of all tweets are reported as being tagged with geographical coordinates as meta-information [41].The creation of this information must be explicitly confirmed by the user, i.e., this an opt-in feature.Positioning is then performed either via the current IP-address of the user, the current cell site location of the user, or via the GPS module of the mobile device.We call tweets for which this meta-information is available georeferenced tweets.
Some of the tweets that do not have position meta-information may be automatically georeferenced based on the actual text.However, due to the short length of the texts there is usually a lack of context, which makes text-based georeferencing challenging.Nevertheless, in previous work it has been shown that by using the correlation between text content and the locations, where they have been created, it is possible to approximate position [12,29,56,67].The methods that are applied for this purpose use toponyms contained in the texts as well as words that are characteristic for specific regions, such as the word "beach" for a coastal region.However, the average positioning accuracy that has been reached by these methods-480km in [67], 800km in [12], 1400km in [56], and federal state accuracy in [29]-is limited and only appropriate for low resolution applications.Likewise, this proves a correlation between location and contents only for small scales.
These findings suggest the assumption that the correlation between microblog-contents and the locations where they have been created depends on the resolution of the analysis.This assumption is further supported by visualization of term frequencies derived from mobile generated texts in the area of Dresden that we show in Figure 1.It can be seen that the toponym "Dresden" is very frequent, suggesting that many tweets in this area do indeed refer to the city of Dresden.
For non-georeferenced tweets the suitability of the location information of the user profiles has been evaluated for the purpose of automated georeferencing [29].The results showed that only 66% of users specify geographically meaningful information as their "home location."In most cases the granularity is city or municipality.Leetaru et al. [41] report that for 34% of all tweets with an explicit georeference self-published home locations correspond to the location where the tweets were published.Finally, Xu et al. [68] found www.josis.orgthat Twitter users more often refer to toponyms near to the home locations that they have specified in their user profiles.

Classification of microblogging texts by natural language processing
The classification of information simplifies the retrieval of information that is relevant to specific tasks.Methods of computational linguistics may support the classification of information in natural language texts.These methods often apply supervised machine classification, whose approaches are based on machine learning.They are well-established in the field and have already been implemented for a number of application areas including spam detection [3] and sentiment analyses [51,52].
All supervised machine classification methods need training data to derive decision criteria supported by statistical procedures.In the field of computational linguistics documents classified by humans are typically used as training data.Figure 2 illustrates schematically the process of supervised text classification with the help of manually classified documents.There are multiple alternatives for the actual classification algorithm.The most common algorithms are based on statistical procedures, whose aim it is to approximate class probability distributions from the training data.These distributions aim to classify new texts by deriving the probabilities for all possible classes, and subsequently selecting the class with the highest probability [10].

Classified documents
Training data as input For machine learning For each class, feature vectors are derived that contain the words and the corresponding probabilities that these words occur in the respective class.For small training datasets a socalled unigram-model is best suited.In this case only single words are used within the feature vectors.The probabilities for the occurrences result from the word frequencies, i.e., the number of a specific word in the whole corpus, and the inverse document frequency, i.e., the number of documents within the whole corpus that contain a specific word.Higher ngram statistics, such as bigram or trigrams, are also possible, but require larger training sets, since otherwise too many unique features which are never repeated prevent documents from being automatically classified.
Hence, the result of the learning process of each algorithm is a model, which contains the features, i.e., the words, and the corresponding weights, i.e., probabilities, for each possible class.Three of the most common text classification algorithms in the field of machine learning are naive Bayes (NB), maximum entropy (ME), and support vector machines (SVM) all of which apply a "bag of words" approach.While this sort of approach to text classification does not consider grammar and qualifying information, such as negation and comparison, they have a correspondingly low implementation complexity.A comparison of all three algorithms in the field of text classification is presented by Pang and Lee [52], who explore the example of sentiment classification.As we will use NB and ME in our empirical study, we will introduce these two algorithms in further detail in the following.

Naive Bayes
Naive Bayes classification is based on Bayes' theorem of conditional probability.With regard to the automatic classification of documents the concrete question is, what is the probability that a document containing a specific word belongs to a specific class.The probability function is derived by analyzing relative word frequencies in the training data.For details on the computation of naive Bayes classification, readers are advised to consult, e.g., Scharkow [55].An overview of applications of naive Bayes in the field of computational linguistics is compiled by Lewis [42].The Bayes' classification is called naive, because it assumes a statistical independence of the single features, i.e., the words in the concrete scenario of text classification.However, this criterion is usually not met in natural language texts, as due to collocations words often co-occur with specific other words.

Maximum entropy
The assumption of statistical independence with respect to features does not need to be fulfilled for maximum entropy classification.The starting point of this algorithm is the concept of entropy, which was introduced into information science by Shannon and Weaver [57,58].In their theory information entropy denotes the average degree of "surprise" that a certain event evokes, which is higher the less predictable the result of a random process is.Thus, the more improbable a certain event is, the more surprising its occurrence is.In turn, events with a high probability are not surprising and thus are not considered to be informative.
In the maximum entropy classification model the average entropy of all possible classifications using the training data is computed.Maximum entropy is given for the most uniform model that is consistent with the constraints given by the classifications derived from the training data.These constraints are represented by word-to-class assignments.Simple word counts serve as the initial weights for the word-class pairs.Higher word counts results in a higher probability that a certain word belongs to a specific class.Finally, the optimal model is determined by an iterative procedure.For further detail on the computation of the maximum entropy classification readers should consult, e.g., Berger et al. [5] and Nigam et al. [49].www.josis.org 3 Data

Data acquisition using Twitter streaming API
As mentioned in 2.1, the platform Twitter was selected as a research corpus.The microblogging texts may be accessed via an application programming interface (API) provided by Twitter Inc.For our research work, we have used the Twitter streaming API with the basic access level "Spritzer."The Twitter streaming API enables clients to continuously record texts on publication.As it is not possible to automatically access tweets that are older than a week, it is important to record them in order to make research corpus of texts available.
The basic access level has the limitation that only about 1% of all tweets may be accessed via the streaming API on publication.The selection of the accessible tweets is a random process.Clients of the API are informed that a so-called rate limitation has happened.The access to the streaming API may be parameterized.Amongst others a spatial parameter in the form of bounding box may be specified.It has already been found that only about 1-3% of all tweets are georeferenced [2,41], which coincides with our observations.The parameterization of the streaming API with a bounding box (5.8 • E, 45.8 • N; 15.1 • E, 55.1 • N) means the requested proportion of tweets is drastically reduced.Thus, although the Twitter streaming API is a black box, we assume that through the parameterization we are able to request the vast majority of the georeferenced tweets in our study region Germany.This assumption is supported by that fact that there is no more rate limitation feedback within the process of requesting the tweets and also by the coincidence of 1% accessible tweets and the 1-3% of georeferenced tweets.We only stored tweets whose position information was created via a GPS-module, which account for about 80% of all georeferenced tweets.The period of data collection was September 2012 to April 2013.In following sections we will also term the collected tweets "documents."

Filtering of raw data
Data acquisition was followed by several post-processing procedures: (1) All tweets that were within the parameterized bounding box, but not within the study region of Germany were removed.(2) For each tweet we performed language detection based on n-grams.We use the implementation of the Apache Tika library [44].As this method of language detection is not highly reliable for short texts, we additionally use the language setting of the user profile.Only if both methods yielded German as the language of the tweet, was this tweet considered for further analyses.(3) We removed all tweets that were not created on one of the following Twitter clients: iPhone/iPad, Android, and BlackBerry.These were the most common operating systems on mobile devices during the data collection period.By this step, we aim to remove tweets that have been created by clients using automatic procedures, such as Foursquare, Instagram, and other services that implement the Twitter API.As this kind of tweet is not user-generated content in the narrow sense, they would bias the study results.We assume, that the mentioned clients indicate actual human usage.Moreover, we assume that mobile devices are more often used in mobile contexts than clients that run in a classic web browser.Figure 3 illustrates the relative proportions of each attribute used for filtering the raw data.
Figure 4 shows a comparison between population density and tweet density for the filtered tweets in the study region of Germany.The patterns are clearly broadly similar for both variables.This is confirmed by the correlation coefficient (Pearson) of 0.94.The highest densities in both estimations are in Berlin, in the Rhine-area, in Hamburg and in Munich.This also confirms an earlier statement that "where there is electricity [people], there are tweets [41]."For the tweet density estimation there is a considerable higher concentration in metropolitan areas, which may be explained by the fact that there is a higher proportion of young people in these regions.Hence, there is also a proportion of people that has a high affinity to use mobile devices and services such as Twitter.

Points of interest
Points of interest are point-shaped objects that have a distinct meaning in the use of maps and navigation systems depending on the scope of the map or the navigation system [66].These may be objects such as shops, restaurants, hospitals, or touristic attractions.Objects that are particularly prominent are also termed landmarks [60].Within our model, we use classified points of interest (POI) to simulate the context of the proximity of each tweet, extracting POI data from OpenStreetMap (OSM).
In earlier studies, it has been found that the OSM POI data is rather heterogeneous [48] and less complete compared to data of the private navigation data provider TeleAtlas.However, later investigations [47] have also shown a rapid growth of the OSM dataset, which supports the assumption that the situation has much improved since 2009.For our investigations, we have used OSM data dating from October 2013.They were directly extracted from the OSM dump using the tool osm2pgsql.Table 1 shows a summary of all feature classes used for our analyses.Polygon features have been abstracted geometrically Figure 4: Comparison of population density and tweet density in Germany.Left: population density (~80.5mpeople), right: tweet density on mobile devices with German as detected language (period 9/2012-4/2013, ~500k tweets, cf. Figure 3).Both visualizations are based on kernel density estimation (KDE) using a Gaussian kernel with radius of 25km.Pearsons' correlation coefficient between both KDEs is r = 0.94.using their centroid.Data inconsistencies are part of the nature of the OSM project, as the classification of POIs in OpenStreetMap is collaborative process.Although there are mapping guidelines documented in the Wiki of the project [50], their interpretation remains a subjective process.On some occasions there is even a lack of consistent mapping conventions.However, we assume that these inconsistencies are not of high relevance for our work, since we focus on relatively unambiguous POI feature classes.

Methods and results
As we explained in the introduction to this paper, our aim is to explore the extent to which georeferenced micro-blogging content can be related to its location.In order to analyze the potential link between tweets and location, we need a model for spatial context.One solution is using a set of classified points of interest as representative of some form of spatial context.Hence, the question of correlation between location and content can be reformulated as are the contents of the tweets related to nearby POI feature classes?
Appropriate methods that allow the classification of the relationship between tweet contents and POI feature classes are required.In the following we introduce three different methods with varying degree of automation for classifying tweet contents: fully manual

Manual classification
Methods The manual classification is the simplest, but most time consuming and least scalable, approach from a methodological point of view.Each text is classified by one or more so-called human annotators.The only classification rule that they are given is to evaluate, whether a text is related to a specific POI feature class.A classification by multiple annotators reduces the effect of subjective ratings.An uneven number of annotators is advantageous, as then for each text classification a majority judgment is possible.The set of all classified texts may subsequently be used as input for supervised machine learning.
In information science this is also called a gold standard [70].
Here, we use the following geographically specific feature classes "railway station," "cinema," "restaurant," and "supermarket," whose relatedness to 5,000 randomly selected tweets was evaluated by three human annotators.2,500 of the 5,000 tweets were selected considering the constraint that they should not be more than 250m away from the nearest instance of the respective POI feature class.This allows us to directly analyze whether the proportion of related tweets is higher in the set of "near tweets" than in the set of "random tweets," where distance to POIs was not used as a selection criterion.Furthermore, as we initially assume a location-content correlation, a high proportion of related tweets in the set of "near tweets" would generate more training samples for the supervised machine learning.In order to illustrate the approach, Table 2 contains five examples of tweets that have consistently been evaluated as being (not) related to the POI feature class "railway station." The degree of agreement may be expressed by the inter-annotator agreement (IAA).IAA allows inferring, how independent the results are from the annotators.Thus, it is a measurement of annotation objectivity.Likewise, it is a measurement for the appropriateness of a method for measuring a specific variable.For its computation, we use the www.josis.org

Text related to POI feature class "railway station"
Text NOT related to POI feature class "railway station" Then I will taken an earlier train and ignore my reservation 165 Euro for 2 fillets of beef from Paraguay, incl salad and 1 beer each.Nice shop ;) The InterCity from Hamburg is on time. . .I can't believe it!#weltbild 2 of 6 boxers look totally ugly.
Nothing bad about Taxi drivers ... A mother from the southern part Germany reads English books to her tired and bored children at 8am in the train.Why? Second @Memo for myself: Holding one's hand out of the window is not a reliable temperature measurement It is only 9:42am and it is damn hot in the local train.I die.AHHHH.
Ran against a wall.Loud laughing started.Table 2: Five example texts that have consistently been evaluated as (not) being related to the POI feature class "railway station" (translated from German).generalization of Fleiss' Kappa [22] for the case of multiple annotators proposed by Conger [14].
The feature class "railway station" serves as first study example.The results showed that annotators predominantly judged microblogging texts to be relevant to this feature class that are in the widest sense about the topic "public transport," e.g., texts about delayed, crowded, or messy trains; the departure or arrival in a city; unusual events in trains or at railway stations; as well as texts about having just caught or missed a train.Table 3 shows a detailed IAA-analysis for this classification.According to the interpretation scheme of Fleiss' Kappa proposed by Landis et al. [39], the IAA of the set "near tweets" are in "almost perfect" agreement, while the IAA of the set "random tweets" are in "substantial agreement."The slightly better agreement for the set "near tweets" may be a result of the higher number of relevant tweets in this set, which enabled the annotators to evolve their decision criteria more precisely.For the feature classes "cinema," "restaurant," and "supermarket" 5,000 tweets were also annotated with regard to their relevance to their respective feature classes.for the feature class "railway station," with the least agreement being found for "restaurant," and "supermarket."Table 5 shows the proportions of the relevant tweets with respect to the distance class of the nearest object of the corresponding feature class.It can be seen that there is a strong dependency in the proportion of relevant tweets from the distance to the nearest objects of the feature class "railway station."Near to railway station, i.e., tweets that are closer than 250m to the nearest centroid of a railway station, the proportion of relevant tweets is about 10%, whereas for tweets that are further than 250m from the closest railway station object the proportion of relevant tweets is only 2%.For the other investigated feature classes the dependency of the proportion of relevant tweets from the distance is less significant.For the feature classes "cinema" and "supermarket" there is a decrease of 50% of the proportion of relevant tweets between near and distant tweets.For the feature class "restaurant" no meaningful difference in the proportion of relevant tweets may be observed.
A possible reason for that may be the selected distance threshold.As cinemas, restaurants, and supermarkets are usually smaller than railway stations, their zone of influence may also be smaller.Thus, in Section 4.2 we describe a method which aims to derive a continuous analysis of distance dependency between content and POIs.

Supervised machine classification using manually classified training data
Manual classification is a time-and resource-intensive process and thus, automation is desirable.This would support investigating a larger sets of tweets with respect to the locationcontent correlation.As introduced in Section 2.3, such automation may be achieved through supervised machine classification.In previous research work, a method that is similar to ours has been used for the classification of situational awareness [64] during mass emergencies where training data was manually annotated with respect to the relevance of individual tweets to events.
Here, our training data was created by annotating tweets that are (or are not) related to a specific feature class.We use the results of the manual classification as gold standard training data and the natural language processing software Mallet [45] for implementation.It has already been reported that maximum entropy outperforms naive Bayes in many www.josis.orgcases-but not in all cases-of text classification [49].As the performance of a classification algorithm depends on the classification scenario, a comparison of both algorithms is recommended and was undertaken in our work.

Pre-processing of microblogging texts
The frequent occurrences of (internet) slang, abbreviations, and misspellings make the automatic text classification more challenging.In order to overcome these problems, we pre-processed the texts.First, we removed URLs, punctuation, special characters, and emoticons.Subsequently, we standardized terms by lemmatization [27] and stemming [11,18] and remove very common so-called stop words which do not contribute to class disambiguation.

Methods for the evaluation of the results
In order to evaluate the results of the supervised machine classification, we used the confusion matrix shown in Figure 5. From this scheme four classic evaluation criteria from information retrieval can be derived: precision, recall, F-measure, and accuracy.The precision (P ) of the machine classification denotes the ratio of correctly classified documents to all classified documents (tweets in our case).High precision implies that many documents that have been found are actually relevant for a specific class.
Recall (R) describes the ratio of all correctly classified documents to all relevant documents.High recall implies that most of the relevant documents were found by the machine classification, while a low recall indicates that many relevant documents were not identified.
The F-measure (F ) is the harmonic mean of precision and recall.As machine classification may either be optimized for good precision or for good recall, this measure may be used to find an optimal solution for both criteria.
Accuracy (A) describes the ratio of all correct classifications to all wrong classifications, by contrast to the previous measures, across all possible classes.
Tuning of the classifier The result of the supervised machine classification for each document is a probability value for each possible class, where the sum of all probabilities is 1.In the case of two possible classes, the default threshold that distinguishes both classes is 0.5.By using manual classified test data the optimum threshold with regard to the F-measure can be identified.

Results
In order to compare the performance of NB and ME, we used the set of tweets that were manually classified regarding their relatedness to the feature class "railway station." The set of "near tweets" serves as training data and the set of "random tweets" serves as test data.Using different sets for training and testing a classifier may lead to slight underestimation of the classification performance.However, using both sets for training and testing ensures that all manually classified tweets are employed.Moreover, we do not expect related topics to be significantly different in both sets.

www.josis.org
It may be seen that ME reaches both higher precision (0.95 versus 0.84) and also higher recall (0.53 versus 0.32).A comparison of the F-measures using classified tweets of all four manually annotated feature classes as training and test data confirms the better performance of ME for this classification task (Table 7).In what follows, we therefore restrict ourselves to the use of ME in the classification task.The example of the feature class "railway station" allows us to investigate the relationship between the number of relevant training documents and the classification performance expressed by precision, recall and F-measure.For this purpose, some of the manually classified documents were removed from the training process.Figure 6 shows that at least 100 training texts are needed to reach a performance of F-measure > 0.5.It may be seen that a further increase of the number of training documents also results in a better classification performance.However, above approximately 200 training texts the rate of performance improvement reached by using more documents for training gets lower.
Table 8 shows a part of the ME classification model resulting from the training with the 246 (cf.tweets").The high weights of the toponyms "Braunschweig," "Berlin," and "Dresden" are due the coincidence that each of them occurs in three texts that have been classified as being relevant to the POI feature class "railway station," usually in the context of arriving in or departing from the corresponding city.As these toponyms did not occur in texts that were not relevant, they seem to be significant for the feature class railway station from the perspective of the ME algorithm.Furthermore the weights of both feature vectors show that indeed there are highly significant words that indicate relevance to railway station objects, whereas there is no word that equally significantly indicates that a certain text is not relevant to railway station objects.
In the next step, ME classification is used to automatically classify a random set of 100,000 tweets.Similar to Table 5, Table 9 shows the proportions of tweets that are relevant to tested feature classes using supervised classification.For the feature class "railway station" the results of the automatic classification are similar to those of the manual classification.The lower proportion of relevant tweets (9.8% versus 6.4%, 2.2% versus 1.4%, 3.3% versus 2.0%, cf.Table 5, Table 9) may be explained by the low recall of the ME classification (0.53, cf.Table 6).The results of the three other tested features classes show a higher deviation from the corresponding results of the manual classification.The main reason for that is the low quantity of training documents for these feature classes-51 for "cinema," 60 for "restaurant," and 30 for "supermarket" in contrast to 246 for "railway station" (Table 5), which results in a lower performance of the supervised machine classification (Figure 6).However, for the POI feature classes "railway station," and "cinema" the dependency of the proportion of relevant tweets from the distance to the nearest instances that has been shown with manual classification (Table 5) is also confirmed by the results of the machine classification.
In order to test for statistical significance in the difference of average distances between relevant and non-relevant tweets, the distribution of the distance of both classes may be compared with each other.Figure 7 shows such an analysis using kernel density estimation for the example of the feature class "railway station."It may be seen that the distributions of the distances of the relevant tweets show a clear shift towards short distances.A test for statistical significance demonstrated a statistically significant distance dependency of the proportion of relevant tweets for the feature class "railway station" for both manually and ME classified tweets (p < 0.01).The pattern of the curve representing the tweets not related to railway stations is mainly a function of the overall distribution of tweets and objects of the class "railway station."Thus, its peak shows the average distance of tweets to railway stations.

Unsupervised machine classification using lexical training data
As the creation of manual training data is a time-consuming process, we sought for an unsupervised method not dependent on manual training data.This would allow us to easily investigate further POI feature classes with regard to their location-content correlation.
Methods A possible approach is to derive words that are relevant to a certain POI feature class using an existing corpus.In this approach, the significance of sentence co-occurrence could be used to identify relevant words.A similar approach has shown that it is possible to derive overall movie ratings, by analyzing the significance of the co-occurrence of all uniand co-occurring in a movie review with the words "excellent" and "poor" [63].For the work presented in this paper, the names of the POI feature classes are used as entry point to derive co-occurring words.We assume that terms related to our POIs (in German), such as "railway station," "restaurant," "cinema," or "supermarket," may be interpreted as category nouns.
The Wortschatz project [24,53] is a candidate corpus containing pre-processed significance scores for co-occurring words.The news corpus was compiled by automatically crawling news websites and since Wortschatz corpora are available in many languages, the approach may be implemented in languages other than German.We used the corpus "2010-news-10M" which contains 10M sentences [28].It has already been shown that there is a significant overlap between topics discussed in Twitter and topics discussed in news media [72], though the bias in Twitter towards personal life, pop culture, and celebrities needs to be acknowledged.
The significance scores were computed by Biemann et al.'s [6] adaption of the loglikelihood-measure described by Dunning [21].We used these significance scores to derive pseudo-texts specific for each POI feature class that we want to investigate.Each of these pseudo-texts consists of 40,000 sentences, of which 20,000 belong to the class "relevant to this POI feature class."Each sentence consists of 10 words.The probability that a specific word is selected for such a sentence is shown in (5), where s wi denotes the score of the significance of the co-occurrence of a specific word with the name of the POI feature class.The name of the POI feature class is selected as a word with the same probability p max(w) as the most significant co-occurring word.

p(w
The remaining 20,000 sentences belong to the class "not relevant to this POI feature class."Likewise, each of these sentences consists of 10 words.For these words the probability of a specific word ( 6) is defined by the frequency of this word f wi in the whole Wortschatz corpus.

p(w
The sentences created by this procedure subsequently serve as training data for the machine classification (cf.Section5.2) and thus replace manually classified training data.One limitation of this approach is the ambiguity of words, such as "bank," for which the cooccurrence analysis contains words significant for all possible word meanings, leading to potential misclassifications.

Results
Table 10 shows the top 20 co-occurring words for the word "Bahnhof" (English: "railway station"), their frequencies in the whole corpus, their co-occurrence frequency and the co-occurrence significance scores.It can be seen that there is a significant overlap with the words contained in the model generated from the manually classified texts, e.g., train/trains, railway, track/tracks, main station, urban railway, Inter City etc. (cf.Table 8).This can be interpreted as an indication of the suitability of this corpus as a substitute to manually classified training data.Finally, Table 11 confirms a partial overlap between the feature vectors of the classification model derived from the manually classified documents and the classification model derived from lexical data (cf.Table 8), e.g.: railway station, train, track, railway, inter city express, urban railway and main station.Table 10: Co-occurring words and corresponding frequencies and significance-scores for the word "Bahnhof" (engl.: "railway station," word frequency = 5,925) taken from the "2010 news 10M sentences" corpus of the Wortschatz project.
In order to evaluate the performance of this fully automatic procedure, the classification was tested using the manually classified documents as test data.Table 12(a) shows the confusion matrix.The classification threshold between the two classes has been tuned from 0.5 to 0.1 in order to maximize the F-measure.The result shows that the precision of this approach is good (0.82).However, the recall of 0.33 is significantly lower than for the supervised classification using manually classified training data (0.53, cf.Table 6).Thus,  the unsupervised machine classification using lexical training data does not reach the same classification performance as the classification using manually classified training data.However, the unsupervised machine classification using lexical training data does outperform a random baseline model and an inverse distance weighted baseline model.Thus, we can conclude that a significant signal may be detected by this approach, as is confirmed by a comparison of the F-measures shown in Table 13.
Beside the F-measures, a precision-recall-graph may be used to analyze the different classification performances for the different POI feature classes (Figure 8).For this purpose, the classification threshold is continuously tuned to different values, which either leads to high precision or high recall.This type of analysis is possible for all 4 feature classes for which manually classified training data, suitable for testing, exist.Both Figure 8 and the Fmeasures (Table 13) show that the reasonable results achieved for the feature class "railway station" using lexical training data were, with some decline in performance, also achieved for the other three POI classes.However, as shown in Figure 8, in the case of supermarket, higher precisions are achieved only at the cost of very low recall.

Computation of the distance dependency 4.2.1 Method
Assuming that all tweets have been classified regarding their relatedness to the respective feature classes, the dependency of the proportion of the related tweets from the distance to the closest POI instance of that feature class may be computed as follows.For this purpose, all POI features are modeled as points (Figure 9) and each tweet is attributed to the closest POI instance of the respective class.The original constellation is shown is Figure 9.1.In the next steps, continuously growing distance buffers are computed around all the POI features of a particular feature class.In Figure 9.2 the buffers contain 2 related (red) and two non-related tweets (grey), which means that a proportion of 50% of the tweets are related at this distance.Assuming that the related tweets are non-randomly distributed over space, we may expect that the proportion of related tweets decreases, as the radiuses of the buffers around the POIs are increased.

Results
In a first analysis, we use this method to analyze whether the proportion of related tweets is above average within the proximity of the corresponding POI features.Figure 10 shows the analysis for the 4 feature classes ("railway station," "cinema," "restaurant," and "supermarket"), for which their relevance to the respective feature classes has been classified manually.It may be seen that there is a significant distance dependency of the proportion of related tweets for the feature class "railway station."Nearest to railway station objects, e.g., within a distance of less than 100m, the proportion related tweets is about 20%, while the average in the whole corpus is only 3.3%.The proportion of related tweets is approximately indirect proportional to the distance of the closest railway station objects.This relationship is also observed for the feature classes "restaurant" and "supermarket."However, the distance dependency of the proportion of related tweets is significantly lower for these feature classes.For example, for the feature class "railway station" the share of www.josis.orgrelated tweets within a distance of 50m is about 7 times higher than the average (~25% versus 3.3%).For the feature classes "restaurant" (~5% versus 2.2%), and the feature class "supermarket" (~1.8% versus 0.6%) the proportion within a distance of 50m is only 3 times above average.For the feature class "cinema," no distance dependency was observed.This contradicts the results found by manual classification (Table 5) and may be explained by an anomaly in the comparably small test data set.Moreover, Figure 10 confirms the assumption that the different POI feature classes, which differ in their extent and their importance, have varying zones of influence.The threshold of 250m, selected in Table 5 is too high to observe a distance dependency in the proportion of related tweets for the feature classes "cinema," "restaurant," and "supermarket."The goal of the unsupervised machine classification using lexical training data was to enable us to easily investigate further POI feature classes.In order to compare the results with those achieved using manually classified data, we again choose the feature classes "railway station," "cinema," "restaurant," and "supermarket."Furthermore, we selected as additional feature classes "airport," "theatre," "museum," "bakery," "bar/pub," "zoo," "school," and "hospital."The results are shown in Figure 11.Again, it can be seen that for some of the feature classes-e.g., "airport" and "railway station"-there is a clear distance dependency in the proportion of related tweets.However, for some other feature classes no such dependency is visible.railway station (mean=1.78%)airport (mean=0.29%)cinema (mean=0.87%)theatre (mean=0.39%)restaurant (mean=1.14%)bar/pub (mean=0.14%)supermarkt (mean=0.66%)bakery (mean=0.02%)zoo (mean=0.09%)museum (mean=0.24%)school (mean=0.32%)hospital (0.21%) Figure 11: Dependency of the proportion of texts that are related to a specific POI-feature class from the distance between the position of text creation and the nearest POI-feature.The results are derived from the unsupervised classified documents using lexical training data.The horizontal lines denote the proportion of tweets related to this POI-feature class in the whole corpus.
A comparison of the results with those reached using manually classified data shows that the patterns for "railway station," "restaurant," and "supermarket" are qualitatively similar.However, there are significant quantitative differences, which may be explained by the low recall of the automatic classification approach (cf.Section5.3).Further features classes that show a clear distance dependency in their proportion of related tweets are "zoo," "hospital," and "theatre."For "bakery," "bar/pub," "museum," and "school" there is no clear distant-dependent trend.
Furthermore, Figure 11 illustrates that the patterns also differ with respect to their zone of influence.For example, the feature class "airport" maintains a relatively high portion of related tweets (5% versus 0.29% in global average) at a distance of about 2km from the nearest POI.This effect may be explained by the size of airports with respect to the positioning of a related POI and the typically peripheral position of airports.

www.josis.org
Finally, Figure 12 shows the probability density function of related tweets for multiple POI-feature classes based on lexical training data.From this analysis, it may be concluded that 50% of all tweets detected as being related to railway station are within 500m of the nearest railway station.This finding could help improving automatic georeferencing of tweets, as in turn it may also be inferred that a tweet that has been automatically classified as being related to a railway station has a 50% probability of being within 500m of the nearest railway station object (in Germany).However, it should also be noted that also 30% of the tweets that are explicitly not related to a railway station are also within 500m of the nearest railway station.This simply illustrates that many tweets are located in city centers, where both railway stations and "tweeting people" are typically found.A similar trend is also found for airport features, with the key difference that a much smaller proportion of all tweets not related to airports are found nearby, once again illustrating the peripheral positioning of airports.For the other classes graphed (cinema, supermarket, and hospital), the two curves are almost identical, confirming the lack of specificity of distance in explaining the locations where individuals tweeted on these themes.

Manual classification
The two main problems with the manual classification are costs and subjectivity of the classification.While the first problem results from the overall low proportion of relevant texts and the resulting large number of texts that must to be classified in order to create an adequate sample, the latter problem is intrinsic to this classification method.Subjectivity may partially be overcome by the classification of the texts by multiple annotators.The subjectivity of the classification is potentially increased by the necessarily general classification guidelines, where annotators were only asked whether a specific text was related to a specific POI-feature class.The lack of context due to the character limit of Twitter microblogging texts (140) made the task even more challenging and the classification prone to disagreement.
However, the results for the inter-annotator-agreement (Table 4) lay between 0.75 ("substantial agreement") and 0.48 ("moderate agreement").This demonstrated that classification was strongly dependent on POI feature class, but nonetheless reasonable results were achievable.One potential reason for these differences depending on the feature class may be that the names of the feature class do not serve as category nouns in all cases.For example, many German speakers would prefer to say "going to/being at ALDI" over "going to/being at the supermarket," which has implications for associations with the POI-feature class name.
Furthermore, some of the POI-feature classes may be more ambiguous than others.For example, "food" may be a topic that is related to the feature class "restaurant."However, for an annotator it may not be clear, whether a text about food relates to a restaurant or some other location.

Supervised machine classification
The main benefit of this approach is that an arbitrary number of texts may be rapidly classified with some known classification quality so long as training and testing data is available.
For the POI-feature class "railway station" a good classification performance was achieved (precision=0.95,recall=0.53,F-measure=0.68,cf.Table 6).However, for the other tested feature classes the results were less convincing (cf.Table 7).One key reason for this poor performance is likely to be the lower numbers of training samples for these feature classes.While for the class "railway station," we worked with 246 manually classified texts, for "cinema" (51), "restaurant" (60), and "supermarket" (30) significantly fewer samples were identified during the manual classification task, which also suggests a lower proportion of tweets related to these feature classes in the whole corpus.Our sensitivity tests indicated that least 100 samples are needed in order to reach a good classification performance (i.e., F-measure>0.5,cf. Figure 6).One approach to increasing the sample size would be simply to build a larger training data set, for example by using crowd sourcing in the classification task (cf.[9,20,36]).
Thus, a key drawback of the machine learning approach is a loss of classification quality.While reasonable precision is maintained, the sample size is too small to give comparable recall values.This is probably a result of the often very short and very specific texts.Furwww.josis.orgthermore, the complexity of NLP caused by the use of slang may only partially resolved by the applied techniques lemmatization and stemming.
With regard to the analysis of the location-content correlation, the results of this classification method lead to similar conclusions if enough training data is available (cf.Table 5, Table 9).

Unsupervised machine classification
The main benefit of this classification approach is that the costly training data generation process is not necessary.The analysis of the significance of the co-occurring words in the applied corpus "Wortschatz news" showed a partial overlap with the classification model derived from manual classification (Table 8, Table 10, Table 11).However, the substitution of manual training data by lexical training data leads to a loss of 33% of classification performance.Likewise, in comparison with the human classification the poor performance is obvious (average F-measure of only 0.32, Table 13).
However, it could also be shown unsupervised machine classification outperformed random and inverse distance weighted baselines, suggesting that this approach does indeed have potential.Its advantage is that it is independent from the subjectivity of individual annotators and of course reduces the need for costly training data.

Distance dependency of the proportion of related tweets
The patterns of the distance dependency of the proportion of the tweets being related to nearby POI instances are, in the main, qualitatively similar when we compare manual and machine classification (Figure 10, 11).However, the patterns differ quantitatively, which can be explained by the low recall of the machine classification approaches.Differences in the distance patterns between the different investigated feature classes may on the one hand be explained by the different arrangement of the feature instances in space and on the other hand by the extent to which users are stimulated to write about these feature classes when being in their locale.
In interpreting the results it is important to note that we did not consider temporal usage patterns.Potentially, if data was filtered for peak usage times of individual feature classes, then distance dependencies might increase.

Points of interest as a model for geospatial context
The simplifications inherent in the chosen model of POIs to represent the geospatial context have an impact on the results of the location-content correlation.For example, in the case of the feature class "railway station," not only the railway stations belong to the context, but also the network of railway-lines.Thus, the observed location-content correlation is presumably lower than it would be if tweets created near to railway lines were also considered as near tweets by the chosen model.Beside the geometric simplification-points only-the model also contains a semantic simplification of the reality.At many locations not only the classified POIs contained in the applied model serve as a stimuli for writing spatially influenced tweets, but also other objects not contained in the model.Thus, other models for the geospatial context might lead to different results of the location-content correlation of tweets.For instance, a higher location-content correlation might be expected for topographic objects with proper names, such as the Eiffel Tower, the Brandenburg Gate, or individual outlets of McDonald's.

Recalling the research questions (1) How can we represent spatial context in order to investigate the relationship between the information content and its surroundings?
For our approach we used a set of classified points of interest to represent spatial context.The advantage of this approach is that such a set is easily available and the analysis is simple.The disadvantage is however that geometry and semantics are highly abstracted.In contrast to previous research that used toponyms [12,29,67] as well as highly specific regions of interest [2], POIs allow us to model spatial context at high resolutions.
(2) How can individual texts be classified such that content can be related to surroundings?
We explored three options for this classification task: manual classification by human annotators, supervised machine classification using training data generated by human annotation and unsupervised machine classification using training data that is derived from an existing corpus.The main challenge for all three approaches is that microblogging texts are very short and thus contextual information is sparse.However, all three methods tested have strengths and weaknesses, with the important caveat that a larger testing dataset is essential given the low sample sizes of relevant texts in some classes.

(3) Can we automate this classification process by means of machine learning?
Text classification may be automated by applying supervised and unsupervised machine learning.The advantage is that a large set of texts may be classified at a constant classification quality.However, the accuracy of this classification is obviously not comparable to human annotations, which are also used as gold standards.For example, for the classification of texts related to the feature class "railway station" a precision of 0.95 and a recall of 0.53 (F-measure=0.68,cf.Table 6) was achieved.For supervised machine learning at least 100 manually classified texts are necessary in order to reach a good classification performance.

www.josis.org (4) Which learning algorithms would be best suited for such kind of automation?
Previous research has suggested the algorithms naive Bayes, maximum entropy, and support vector machines [51,64].We tested NB and ME as they are implemented in the applied software Mallet [45] finding that ME outperforms NB by 50% on average (cf.Table 7) for the tested feature classes.
(5) Is there a corpus that allows us to appropriately substitute manual training data for the classification task?
Previous work pointed to a thematic overlap between topics discussed on Twitter and topics presented on news websites [72].This suggests using a corpus generated from news websites to derive training data that may substitute manual training data.We therefore used the corpus "Wortschatz news" and selected words that significantly co-occur with the titles of POI-features as models to find texts related to these feature classes.The results show that the machine classification using this substitution performs some 30% worse than the machine classification using manually classified training sample (cf.Table 13).However, the results regarding the location-content correlation of the tweets using this substitution are qualitatively similar for 3 of the 4 tested feature classes (Figure 10 , 11).This indicates that if precision can be maintained, even at the cost of low recall, it is still possible to extract meaningful relationships between POIs and the locations of individual tweets.Our analysis of location-content correlation using the model of the relevance of the texts for selected POI-feature classes does not yield homogeneous results.For some feature classes a significant distance dependency of the proportion of related tweets may be determined.For these feature classes it may be concluded that related tweets are not completely randomly distributed in space.For instance, for the feature class "railway station," at a distance of 100m, about 8-20% of tweets are related to the feature class, a significantly higher proportion than the 2-3% of tweets in the corpus as a whole related to the class.
More generally, our results suggest that the impact of nearby POIs on mobile microblogging contents is moderate and its intensity depends on the specific POI-feature class.In the set of the tested feature classes, we found a location-content correlation in the proximity of the feature classes "railway station," "airport," "restaurant," "supermarket," "theatre," "zoo," and "hospital."By contrast, for the feature classes "bakery," "bar/pub," "museum," and "school," no location-content correlation has been found.Thus, we may infer that some feature classes attract more location specific mobile microblogging activity than others.From the perspective of the "Twittersphere," these feature classes are prominent.A prediction, however, as to which feature classes particularly attract Twitter activity within their proximity does not seem to be easily possible.Twitter activity seems to depend on heterogeneous factors.The differences observed, e.g., between the feature classes "railway station" and "cinema," "restaurant," and "supermarket" (cf. Figure 10,11) suggest that the topic railway is much more site and time dependent than the topics "cinema," "restaurant," and "supermarket."Thus, in order to predict which topics attract Twitter activity, one would need to assess which topics are site and time dependent.
Nevertheless, the general conclusion that the current location of the users does not dominate their microblogging-activities is consistent with their intentions [17,32] of using the service, which are in the main daily chatter and personal communication.Both activities do not necessarily need to be influenced by the spatial context.
In order to maintain consistency with prior research it is important to mention that the intensity of the location-content correlation also seems to depend on the scale of the analysis.Such a correlation would be consistent with previous findings [2,12,29,30,56,67], which demonstrated a location-content correlation at small and medium cartographic scales.
Last but not least, location-content correlation is also likely to depend on the temporal dimension, as topics which are relevant in social networks seem to depend on time [69].We assume that location-content correlation would be higher, if we consider temporal patterns.However, we have not tested this in the current research work.

Implications of the findings
The findings regarding the low correlation between location and content of mobile generated microblogging texts on high resolution have implications for the automatic georeferencing of microblogging texts for large cartographic scales.The potential increase of the accuracy of the geo-referencing by using their spatial correlation to classified points of interest is limited and depends on the specific POI-feature class.One possible approach to partially overcoming this limitation might be the analysis of the social network of a user including her/his conversation as well as her/his history of tweets.It seems reasonable to assume that such information might improve the accuracy of geo-referencing through tweet content, though we would emphasize that such a correlation remains speculation.
A more general conclusion is that the application of tweet-analyses for high resolution applications should be approached with care as the correlation between the contents and the locations of tweets that is required for these applications is probably often too low.Thus, applications that rely on the existence of a strong location-content correlation-such as (spatial) opinion-, emotion-, and sentiment-research, decision support systems for natural hazards, or place descriptions-need to demonstrate whether geo-referenced tweets are suitable for the corresponding application.Crampton et al. [15] conclude in this context that the scale of analyses of tweets needs to be adapted to their spatial resolution.Furthermore, they emphasize that culture, religion, and language have an impact on spatial patterns of tweets.On the other hand, however, also long distances may be spanned by the ties in what is after all also a social network.Hence, the structure of these networks also needs to be understood in order to understand the spatial patterns of tweet distribution.
Furthermore, our findings suggest that for high resolution spatial analyses of tweets automatic filtering and validation methods are needed as essential.Given that work with Flickr and other photographically based social media seems to have demonstrated a more direct location-content correlation [16] we suggest that tweets containing photos could be preferably used in spatial analysis, particularly at large scales.
In summary, our work implies that treating tweets as being relevant to a set of coordinates with precision of the order of tens of meters is unlikely to be a sensible approach to exploring such data.There is a pressing need to more critically consider the extent to which the coordinates of a piece of information can be related to location by considering issues such as scale, abstraction and more cognitively adequate tessellations of space.www.josis.org

Figure 1 :
Figure 1: Visualization of term frequencies, after stop word filtering, in mobile generated tweets in the center of Dresden.

Figure 2 :
Figure 2: Schematic representation of supervised text classification using manually classified training data (figure is an adaptation of Scharkows' illustration [55]).

Figure 3 :
Figure 3: Representation of the relative proportions of the different attributes used for the filtering of raw data.

Figure 6 :
Figure 6: Correlation of the performance of supervised machine classification and the number of training documents.Classification method: ME.

Figure 7 :
Figure 7: Density plots of the distributions of tweets with regard to their distance to the nearest railway station on logarithmic scale.Left: results of manual classification, right: results of machine classification.

Figure 8 :Figure 9 :
Figure 8: Relationship of precision and recall for the automatic classification using lexical training data while tuning the classification threshold.

Figure 10 :
Figure 10: Dependency of the proportion of texts that are related to a specific POI-feature class on the distance between the position of text creation and the nearest POI-feature.The results are derived from the manually classified documents.The horizontal lines denote the proportion of tweets related to this POI-feature class in the whole corpus.
− nearest POI−object (m) share of Tweets related to the respective POI−feature class (%) nearest POI−Object (m) share of related / non−related Tweets (%) related to railway station non−related to railway station related to cinema non−related to cinema related to supermarket non−related to supermarket related to airport non−related to airport related to hospital non−related to hospital

Figure 12 :
Figure 12: Probability density function of the proportion of related tweets in relation to the distance between the position of text creation and the closest POI-feature.

( 6 )
Does the proportion of texts related to location-specific information show a decay over distance-in other words are the locations of the texts which relate to specific locations non-randomly distributed in space?

Table 1 :
POI feature class, corresponding OSM-tag and number of features in the study region of Germany extracted from the OSM-dump (date of dump: 23/10/2013).classification, supervised machine classification using manual training data and unsupervised machine classification using lexical training data.

Table 3 :
Detailed IAA-analysis of the annotation of tweets according to their relevance to the POI feature class "railway station."The column "gold standard" contains the majority votes of all annotators.

Table 4
contains the IAA for these annotations, illustrating that the highest agreement was found

Table 4 :
IAA of the relevance judgment of tweets to different POI feature classes.

Table 5 :
Portion of tweets that are related to different POI feature classes.

Table 6 :
Confusion matrices for ME and NB supervised classification.Training data: gold standard of 2,500 manually classified (related to "railway station" yes/no?) tweets within a distance to the nearest railway station < 250m.Test data: 2,500 tweets with random distance to the nearest railway station.

Table 7 :
Comparison of the performance of ME and NB classification using the F-measures for feature class identification.

Table 5 )
manually classified texts of the feature class "railway station" (set "near

Table 8 :
Top 20features for the class "related to railway station" and the class "not related to railway station."The weights are determined by ME using manually classified training data.German words are stems, English words are translations (not stemmed).

Table 9 :
Portion of tweets that are related to different POI feature classes.

Table 11 :
Top 20features for the class "related to railway station" and the class "not related to railway station."The weights are determined by ME using lexical training data.German words are stems, English words are translations (not stemmed).

Table 12 :
Comparison of the classification performance (F-measure) for railway station based on (a) lexical training data, (b) a random model and (c) an inverse distance weighted model.

Table 13 :
Comparison of the F-measures for the POI-feature classes, for which manually classified test data exists.

Table 14 :
Schematic comparison of the three classification methods.