Prediction of employment and unemployment rates from Twitter daily rhythms in the US

By modeling macro-economical indicators using digital traces of human activities on mobile or social networks, we can provide important insights to processes previously assessed via paper-based surveys or polls only. We collected aggregated workday activity timelines of US counties from the normalized number of messages sent in each hour on the online social network Twitter. In this paper, we show how county employment and unemployment statistics are encoded in the daily rhythm of people by decomposing the activity timelines into a linear combination of two dominant patterns. The mixing ratio of these patterns defines a measure for each county, that correlates significantly with employment ($0.46\pm0.02$) and unemployment rates ($-0.34\pm0.02$). Thus, the two dominant activity patterns can be linked to rhythms signaling presence or lack of regular working hours of individuals. The analysis could provide policy makers a better insight into the processes governing employment, where problems could not only be identified based on the number of officially registered unemployed, but also on the basis of the digital footprints people leave on different platforms.


Introduction
Until recently, it has been a time-consuming, costly and arduous work to collect and analyze data about individual humans at a large scale. With the advent of the digital era, there is a growing amount of data accessible online that enables the analysis and modeling of human behavior.
However, our understanding of these digital data sources and the methods that connect the data to real-world outcomes is still limited.
Several aspects on the possible usage of mobile phone records and social media status updates in the estimation of official data, such as census, demographic or land use records have been discussed in recent papers. A promising approach is the analysis of the diurnal rhythm of humans. Due to the 24 hour periodicity of the Earth's rotation, we are biologically bound to show daily periodic behavior both at the individual and at the aggregate level. This periodic cycle is governed mainly by internal biochemical processes [1,2,3,4], but the impact of external factors and the environment also leaves its imprint on these daily patterns [5,6].
As Säramaki and Moro point out in their paper [7], an interesting application is to consider the geospatial aspects of the aggregate level of daily rhythms, as it can provide insight into several different phenomena ranging from the actual land use patterns in a city [8,9,10,11,12,13,14,15,16,17,18] and on a campus [10], to the tracking of anomalous events [19,18], or the estimation of population size [20], mobility patterns [21], poverty [22] or crime rates [23] in a certain area.
Because these aggregate patterns always consist of the superposition of the daily rhythms of individuals, it is worth investigating how the main features of the aggregate level form from superposition. If we can cluster individuals into more or less homogeneously behaving groups based on their daily patterns [24], then the aggregate pattern can be understood as the combination of the group patterns, and the group that has more individuals dominates the aggregate daily rhythm. The groups of individuals can form along many demographic and/or socioeconomic factors, of which being employed and going to and from work at regular hours is the most determining one with respect to the daily activity patterns. Thus, decomposing the groups from the aggregate patterns in different geographical regions may give insight into the estimation of employment statistics in that region.
Nowcasting or estimating unemployment rates using the digital traces of search engines has already been in the focus of several papers [25,26,27]. It has already been shown, that daily activity patterns of individuals can be linked to the regularity of their working hours [28]. Because the loss of a job has severe psychological consequences [29], the effects of a mass layoff can be detected in the unemployment rates and provide a possibility of forecasting macro-economical effects based on observation of several individuals [30]. In [31], there is a strong evidence that aggregated daily activities of certain time intervals of geographical regions can be indicative of unemployment rates.
In this paper we obtain 63 million geolocated messages from the publicly available stream of the social network Twitter from the area of the United States sent between January and October 2014. We aggregate Monday to Friday relative tweeting activity for each hour in each US county to form an average workday activity pattern. We then assume that these activity patterns form a roughly linear subspace of the 24-hour "timespace". By finding this linear subspace, that is, by finding the line on which the county patterns lie, we are able to give a measure that is linked to the ratio of two groups of people tweeting in a county. We then show that this measure correlates significantly with county employment and unemployment rates, and that the average patterns corresponding to the two groups can be linked to lifestyles connected to regular working hours or the lack of them. We thus give a possible framework for decomposing the digital activity patterns of geographical regions and linking the decomposition to employment and unemployment rates.

Twitter dataset
We use the data stream freely provided by Twitter through their Application Program Interface, which amounts to approximately 1% of all sent messages. In this study, we focus on the part of the data stream with geolocation information. These geolocated tweets originate from users who chose to allow their mobile phones to post the GPS coordinates along with a Twitter message.
The total geolocated content was found to only comprise of a small percentage of all tweets; therefore with data collection focusing only on these, a large fraction of all geo-tagged tweets can be gained [32]. Our dataset includes a total of 63 million tweets from the contiguous United States collected between January 2014 and October 2014. These are all geotagged -that is, they have GPS coordinates associated with them. We construct a geographically indexed database of these tweets, permitting the efficient analysis of regional features [33]. Using the Hierarchical Triangular Mesh scheme for practical geographic indexing [34,35], we assigned a US county to each tweet. County borders are obtained from the GAdm database [36].

Demographic datasets
For the population-weighed linear model of the next section, we obtain county-level population statistics from the US 2010 Census [37]. We download the unemployment and labor force data for the time window of the Twitter dataset from the Local Area Unemployment Statistics page of the Bureau of Labor Statistics [38]. We take an average of the months ranging from January 2014 to October 2014 for each county.
Though unemployment levels are defined as the number of unemployed per total labor force in a county, we define the share of employed as the number of employed divided by the whole population of a county. This measure fits the model for the daily rhythm better as discussed in the Results section.

Daily activity patterns
We define a daily activity pattern with hourly resolution for each county, which are enumerated by k = 1 . . . M . We take all tweets originating from a given county from the period between To improve the quality of our dataset, we consider only those counties in which the overall tweet count during the ten month exceeded the threshold of 1800. Thus, we are left with 1884 counties for our analysis.

Linear model
We assume that the tweeting pattern of a county can be represented by the linear combination of only two universal patterns (A and B) that are mixed for each county k with a proportion of α (k) , and 1 − α (k) , respectively. Thus, we identify the two universal patterns that compose the pattern of a county as corresponding to two differently behaving population groups, whose aggregate tweeting patterns form A and B. We have no further restriction on these α (k) values, they can be any arbitrary real numbers.
Then the predicted activity x (k) i of a county k in hour i would be the following linear combination: Let us denote the weight of each county by w (k) , which is proportional to its population p (k) , . We then define the squared error of our model as We would like to minimize this error with subject to the two conditions It can be shown (see SI), that the minimum occurs if A − B is parallel to the eigenvector m corresponding to the biggest eigenvalue of the weighed covariance matrix C, and that B can be chosen as the average of y (k) s. Here, an element of the covariance matrix C is where In both cases, we now consider a linear representation of the data with a coordinate system where the mean y sets the origin and m is the direction of the line. We calculate α (k) values for each county by projecting y (k) onto this line (see SI). A positive α (k) means a county, where the majority of people are active on Twitter in correspondence with the daily rhythm dictated by m, accordingly, negative α (k) is in connection with an opposite pattern.
Because the linear equation system derived from the minimization of the squared error is linearly dependent, the scale on our line is not set (see SI), as A − B is only determined up to an arbitrary scaling factor. Thus, the α (k) values are also determined only up to a scaling factor.
Let us now choose A and B to be two standard deviations of α (k) -s away from the origin y in the two directions of our new linear coordinate system: A and B are both normalized to 1, where in the 2-dimensional case their components represent the selected two hours, while in the 24 dimensional case they represent the 24 hours of the day.

Results and discussion
In this section, we present the description and the discussion of the main results of this paper.
First, we investigate the correlation between the activities of individual hours and employment and unemployment rates, and choose two dimensions with which employment and unemployment levels have maximum or minimum correlations. We then evaluate to what extent the linear model is a valid description of our data for these most separating dimensions (2) and then for all possible dimensions (24) of our dataset. Second, we discuss how the linear models in 2 and 24 dimensions separate the two population groups with the two distinct activity patterns, and give a possible interpretation of these patterns. Third, we connect the two groups with real-world indicators like share of employed in a county, and discuss the plausibility of the correspondence of the daily patterns of the two separate groups to employment status.
We first evaluate population-weighted Pearson correlations for each hour i between y To check the linearity of the model described in the Methods section, we first choose the coordinate system of the hours having the extreme correlation values with employment levels. while it is not for the unemployment. A possible interpretation is that a stricter daily rhythm is imposed upon those who are employed, as such, the characteristics of their activity curves mean a stronger overall pattern than that of the unemployed. Nevertheless, the result shows that high a α (k) is significantly bound to higher employment, and lower unemployment rates, and that the overall shape of the activity timeline can give us more information than just using one feature of a whole day. The similarity of the regional distribution of α (k) , unemployment and employment rates are visualized on the three maps of Fig 6. Our results are in line with previous research carried out for Spain in [31], where share of Twitter activity during a window of the morning hours (8-11am), afternoon hours (3-5pm) and of the night hours (0-3am) correlated significantly with unemployment rates among 25 to 44-year old inhabitants of Spanish administrative areas. High morning and low night activity indicated lower unemployment rates, which is in correspondence with our correlations. Although in Spain high afternoon activity correlated positively with unemployment levels, we cannot observe this phenomenon in the US. Due to the bias in the age of Twitter users towards younger age groups [39], our calculated county activity patterns are not representative of the whole population. We believe that our model could be improved by incorporating labor force data detailed by different age groups.
That correlation with unemployment is significantly lower than correlation with labor force share of the population can be related to the fact that the share of employed should overlap more with the population exhibiting the "working" pattern A, whereas officially registered unemployed people are not distinguishable in this context from those who are on a maternal leave or are retired etc. We also believe that there are other inherent reasons for example the more flexible working hours in the creative industry that limit the power of such a simple model explaining the employment patterns of a geographical area.

Conclusions
In this paper we analyzed an extensive collection of geolocated tweets originating from the By projecting county activity patterns onto these lines with the mean as the origin, we obtained a measure for each country that indicated the extent to which the tweeting pattern of a county resembles that of the first eigenvector. This measure has been shown to correlate significantly with county labor force shares and unemployment rates, though in the 2-dimension case, these correlations could not enhance the performance of the single hourly correlations.
Using all 24 dimensions, we obtained a better Pearson correlation of 0.46 ± 0.02 and −0.34 ± 0.02 for employment and unemployment, respectively. The signs of the correlations indicate a relationship where counties exhibiting a higher tweeting activity during the daytime (6am-8pm) have higher employment and lower unemployment rates, and counties with increased night activity can be related to lower employment and higher unemployment rates. These correlations show, that even though Twitter population is biased towards younger age groups, and employment data was considered for all age groups, the underlying relationship between daily activity patterns and employment data can be captured with plausible outcomes.
Our results thus showed, that by analyzing a relatively sparse publicly available geolocated dataset, a very simple model can explain to a significant extent such an important socio-economic indicator as employment/unemployment. We believe that our model can be even further im-

Technical details for the Methods section
We define a daily activity pattern with hourly resolution for each county that are enumerated by k = 1 . . . M . Thus, each county (k) is represented by a 24-dimensional vector (y (k) ), where the elements of y (k) are aggregated normalized hourly tweeting activities.
We assume that the tweeting pattern of a county can be represented by the linear combination of only two universal patterns (A and B) that are mixed for each county k with a proportion of α (k) and 1 − α (k) , respectively. We have no further restriction on these α (k) values, they can be any arbitrary real numbers. A and B are both 24-dimensional vectors normalized to 1, the 24 dimensions representing the 24 hours of the day.
Then the predicted activity x (k) i of a county k in hour i would be Let us denote the weight of each county by w (k) , which is proportional to its population p (k) , such that w (k) = p (k) / M k=1 p (k) . We then define the squared error of our model as We would like to minimize this error with subject to the two conditions i A i = 1, i B i = 1, which leads to the following expression to minimize with Lagrange multipliers λ a and λ b : The derivatives yield the following linear equation system: Summing Eq 8 and Eq 9 for j yield 0 for the Lagrange multipliers λ a and λ b . Thus, the problem reduces to minimizing E, which actually measures the sum of squared distances from the line parametrized by A − B, B and α (k) for a county k.
the equation system is not linearly independent. Thus, we cannot obtain all exact values for A j , B j and α (k) , they will be dependent on each other.
Expressing α (k) from our equation system yields: The line from which the summed distance of the datapoints is minimal is the line whose direction is parallel to the eigenvector (m) corresponding to the largest eigenvalue of the covariance matrix C, where if denotes the weighted mean ( k w (k) = 1, w (k) ≥ 0 ∀k = 1 . . . M ) By substituting the expression for α (k) into Eq 8)+Eq 9, and averaging over k we get that the point y should fit onto our line.
Thus, we get a valid solution of our error minimization problem, if we choose and calulate α (k) values according to Eq 12.