Nowcasting commodity prices using social media

Gathering up-to-date information on food prices is critical in developing regions, as it allows policymakers and development practitioners to rely on accurate data on food security. This study explores the feasibility of utilizing social media as a new data source for predicting food security landscape in developing countries. Through a case study of Indonesia, we developed a nowcast model that monitors mentions of food prices on Twitter and forecasts daily price fluctuations of four major food commodities: beef, chicken, onion, and chilli. A longitudinal test over 15 months of data demonstrates that not only the proposed model accurately predicts food prices, but also it is resilient to data scarcity. The high accuracy of the nowcast model is attributed to the observed trend that the volume of tweets mentioning food prices tends to increase on days when food prices change sharply. We discuss factors that affect the veracity of price quotations such as social network-wide sensitivity and user influence. ABSTRACT Gathering up-to-date information on food prices is critical in developing regions, as it allows policymakers and development practitioners to rely on accurate data on food security. This study explores the feasibility of utilizing social media as a new data source for predicting food security landscape in developing countries. Through a case study of Indonesia, we developed a nowcast model that monitors mentions of food prices on Twitter and forecasts daily price ﬂuctuations of four major food commodities: beef, chicken, onion, and chilli. A longitudinal test over 15 months of data demonstrates that not only the proposed model accurately predicts food prices, but also it is resilient to data scarcity. The high accuracy of the nowcast model is attributed to the observed trend that the volume of tweets mentioning food prices tends to increase on days when food prices change sharply. We discuss factors that affect the veracity of price quotations such as social network-wide sensitivity and user inﬂuence. Cross correlation between ofﬁcial and nowcasted prices across target


19
The ability to rapidly monitor food price fluctuations is critical to government institutions, production 20 companies, and investment banks for making agile policy decisions and managing risks (Cavallo, 2013; 21 Shaun and Lauren, 2014). The demand for data has increased in a hyperconnected world, where countries, 22 markets, and people affect one other in a complex manner (Pentland, 2014). However, not all countries 23 have the capability to monitor high-resolution commodity price data. Some developing countries publish 24 official commodity price data at a slower rate, sometimes monthly or quarterly. This significant delay in 25 releasing economic indicators is largely due to the lack of infrastructure to gather market data (Aizenman 26 and Marion, 1999). In fact, political and financial reasons have hindered a few countries from publishing 27 the consumer price indexes for several decades (Grosh and Glewwe, 2000). Nonetheless, because 28 commodity price and in particular food insecurity in developing regions is extremely dynamic 1 , the ability 29 to track market status quickly and to predict food commodity price trends is an all the more critical 30 challenge (Gouel, 2013). 31 Remarkable progress has been made over the last decade in acquiring market data. First via access 32 to new technology. According to the International Telecommunication Union, there are more than 7 33 billion mobile cellular subscriptions in the world, corresponding to a global penetration rate of 97%. Such 34 technology enables developing countries to attain a level of financial data access that until recently was 35 only possible in more developed economies. Second is the innovative proposals and methods that fill 36 information gaps and track economic data better in places where standard approaches cannot be easily 37 applied. For instance, price indexes constructed from the Web (such as online shopping sites that directly 38 cite commodity prices) can produce alternative inflation estimates (Cavallo, 2013). Crowdsourcing is 39 another such approach, where price quotations reported by individuals are collected and analyzed in 40 initiatives like Premise (2016). In Nigeria and India, microeconomic databases of consumer goods were 41 successively built by combining scrapers for online e-commerce data with a crowd-sourced data via 42 mobile applications (Liz, 2013). Price collectors in this system comprise retailers and non-professional 43 volunteers, who receive compensation in various forms of rewards like money and communication credit. 44 The World Bank has also conducted a pilot study for crowd-sourced price data collection through mobile

Manuscript to be reviewed
Computer Science phones and non-professional price collectors (Hamadeh et al., 2013). Price data was collected for thirty 46 tightly specified food commodity items on a monthly basis for approximately six months in eight pilot 47 countries. 48 Recently an alternative source of information has become widely available as a new economic 49 signal (Pappalardo et al., 2016). User-generated data from various online social network services (OSNs) 50 have been a source of indicative signals for predicting various societal phenomena including human 51 behavior in crisis situations (Vieweg et al., 2015), economic market changes (Bollen et  has several benefits. First, social network signals are less costly than crowdsourcing because there is no 54 need to reward individuals who generate data (Simula, 2013). Second, the continuous nature of OSN data 55 allows for near real-time monitoring or what is called nowcasting (Giannone et al., 2008).

56
Designing a nowcast model for commodity prices, however, is a complex problem. This is because the 57 task needs to produce accurate estimates of the official commodity prices, provide early warning signals of 58 unexpected spikes in the real world, and adapt to a variety of commodities for wider applicability (Lampos 59 and Cristianini, 2012). These goals are harder to achieve in developing countries, where economic status 60 is volatile and social media is less widely used. Nonetheless, rapidly expanding Web infrastructure, 61 supported by humanitarian projects that provide free Internet in rural areas such as Internet.org (Facebook,62 2016), is being observed in many developing countries (Ali, 2011) and social media data can hence serve 63 as an additional, non-invasive measurement method for those regions.

64
This paper presents a case study of adopting micro-blogging platform signals on Twitter as an 65 additional data source for building a food price nowcast model in Indonesia. This research was initiated 66 by the government of Indonesia as part of its effort to combine and adopt different sources of information 67 to produce highly credible market statistics. Four critical food commodities (beef, chicken, onion, and 68 chilli) were chosen as the first set of items to be tracked based on national food security priorities and data 69 availability. Twitter was chosen as a data source, because of its popularity within the country; Indonesia 70 has one of the highest adoption rates in the world for Twitter, both in terms of number of users and amount 71 of generated content.

72
The main goal of this work is to create a nowcast model that reproduces time series of daily prices for 73 the four chosen commodities during a 15-month investigation period between June 2012 and September 74 2013 based solely on price information from tweets. This main goal is achieved by three specific aims.
75 First, the model should be able to provide price time series that highly correlate with real-world price 76 trends. We conduct an evaluation by using pearson correlation coefficient to determine a correlation 77 between an official and predicted price time series. Secondly, the model should be able to estimate the 78 absolute price value with minimized error in daily scale. We conduct the evaluation by using mean 79 absolute percentage error (MAPE) to evaluate a magnitude of error between an official and predicted 80 price time series. Thirdly, the model should be capable of nowcasting food price, which is defined as 81 capturing information on a real-time basis within a short time gap typically in the single day range. For 82 checking the feasibility of using the model as a daily price predictor, we conduct an additional evaluation 83 process by using cross-correlation coefficient (CCF) that could estimate how an official and predicted 84 time series are related at different time lags. We have shown that those predicted time series have the 85 highest correlation at a lag within the timeframe of a single day, therefore we could clarifies that the price 86 time series produced by the model is able to be used for nowcasting.

87
A two-step algorithm is proposed in this research. In the first step, a keyword filter is used to extract 88 tweets mentioning price quotations of the four food commodities from the entire corpus of tweets that 89 were generated from Indonesia between June 2012 and September 2013, a timeframe of 15 months. A 90 numerical model parameter is also used to filter the tweets to ensure that the tweet price does not exceed a 91 maximum allowable daily percentage price change (computed based on historical rates). The keyword 92 and numerical filters extracted 41,761 relevant tweets from the data. In the second step, a statistical model, 93 using OSN data, is built to accurately estimate food prices for each commodity in order to assist with 94 the official statistics publicized by the Indonesian government. The nowcast model produces estimates 95 of commodity prices that have a high correlation with official food price statistics over the timeframe 96 covered and shows better prediction performance than existing algorithms. This paper also describes the 97 effect of several important social network-wide variables, via testing the robustness of the model under 98 data scarcity conditions and by modeling user-level credibility to suggest an enhanced sampling strategy. 99 This research finds that Indonesians do tweet about food prices, and that those prices closely approx- nowcasting-food-prices.

107
Data collection 108 Indonesia is a good testbed for this study for two reasons. First, reliable ground-truth data is available on   Tweets were collected through a firehose access to Twitter, which returns a complete set of data. 123 We screen for price mentions between June 2012 and September 2013, for 15 months. A taxonomy of 124 keywords and phrases in Bahasa (i.e., the official language in Indonesia) is developed and used. The 125 full taxonomy is mostly composed of commodity names, prices, and units (Table 1)

Computer Science
As a result, a total of 78,518 tweets from 28,800 accounts are collected over the 15-month period.

139
Below is an example tweet mentioning beef price and its translation in English: Tweet data contain noisy information and need to be cleaned prior to analysis. We employed the following 144 measures in data cleaning. First involves removing ambiguity in meaning. An obvious case of ambiguity 145 arises when a single tweet quotes the price of two or more commodity items. Such cases occur in the 146 5% (2,607 times) of the entire price quotation data and were removed in advance of further investigation.

147
Another case of ambiguity arises when the mentioned price is in relative terms, not in absolute terms 148 (e.g., "price increased by X amount"). For instance, the word 'naik' in Indonesia means 'increase (up 149 to)' or 'by'. Our data shows that price quotations containing the 'naik' word resulted in extremely small 150 price ranges compared to the rest of the data. Hence, we removed tweet data containing this word, which 151 accounted for 8% of the data.

152
Another important data cleaning task focuses on removing redundant messages or spam bots. Certain 153 bot accounts can be identified based on their large quantity of duplicated tweets. We assume accounts that 154 posted more than 100 tweets with over 80% of duplicated messages are bots. Table 2 shows the list of  agriculture depends on machinery and transportation (Richard, 2011)). We find that people post more 181 tweets during price-rising periods compared to price-decreasing periods. This tendency is more apparent 182 with food commodities that have volatile price fluctuations and a smaller total volume of tweets -onion 183 receives on average 2.8 times more tweets when prices are rising compared to price-decreasing periods.   in July 2012 based on our tweet data, which should be considered as outliers. Second, such outliers 207 lead to an overall poor quality of price prediction measured by the mean absolute percentage of error.

208
Simply eliminating outliers would yield a large reduction in prediction error. Therefore devising a filter to 209 eliminate unnecessary noise and find meaningful signals from the dataset is critical for price prediction. price. In the model we assume market prices are non-stationary time series; this is consistent with the 215 assumption that has been made in relevant studies (Leuthold, 1972;Working, 1934). We further consider 216 the Markov process for price dynamics as assumed in (Zhang, 2004;Ghasemi et al., 2007). Hence, let 217 today's price P t be determined both by yesterday's price P t−1 as well as today's price quotations from

218
Twitter P tweet t . The weighting factors in the Eq. 1, α and β , represent the relative importance of these two 219 quantities on today's price. The model would then respond to the current market quotes faster when β is 220 larger than α, in which case a larger degree of price fluctuations are expected.
Furthermore, we assume that daily food prices do not change radically. The maximum change in 223 commodity price that we observe from historical data is marginal for most days. For instance, the largest 224 deviation seen for the beef price was changing by 2.5% from one day to another on Aug 16th 2012. This 225 observation leads us to assume that prices of a commodity on a given day and the consecutive day would 226 be within certain bounds. This is modeled as a variable δ defining the maximum allowable price change 227 rate. Any social signals that exceed this change limit from one day to another will be eliminated from tweets. We set the starting price P 0 as the commodity price on the first observation date.
where no tweets over n days  In determining δ , a parameter that determines which tweets are accepted or ignored in the model, 291 we examine the price change dynamics from historical records. Beef price changed gradually with a 292 maximum price change of no more than 2.5% from one day to the next, whereas onion showed a rapid 293 change in price with a maximum change rate of 15.1% from one day to another. This means that the 294 daily allowable change rate should be set higher for onion compared to beef. We set δ by training with a     Table 3. Prediction performance comparison between the models Table 3 shows the result for both the correlation and the absolute error. Again, IQR and KDE do not can conclude that the suggested model is capable of nowcasting daily food prices in Indonesia. This study shares insights into building an affordable and efficient platform to complement offline surveys 336 on food price monitoring. The market data gathered through social media help to predict economic signals 337 and assist food security decisions. Price quotations in social media are a new type of information that 338 need extensive cleaning before usage. A naive statistical filtering method is no longer effective, because 339 price distribution is not normally distributed and contains various noise elements as shown in Figure 1B.

340
The proposed nowcast model attains acceptable performance with a simple filtering method that does not 341 rely on sophisticated natural language processing techniques. In applying the suggested model to other 342 languages, a taxonomy of keywords related to commodity names and prices would need to be identified.

343
Our model has minimum language dependency and no grammatical considerations are required. Its filter 344 operates via keyword extraction and numerical analysis based on the characteristics of the Twitter data.

345
The model can also handle data sparsity, this quality is important given that people do not always mention 346 prices on social media.

347
The nowcast model, which is tested successfully on four main food commodities in Indonesia, can be

Manuscript to be reviewed
Computer Science day-to-day price fluctuations. If there are not enough tweets mentioning food prices, algorithms like 359 nowcast will face a data scarcity problem. In fact, data shortage can be witnessed in the historical data.

360
Tweets that mention food prices occupy no more than 0.07% of the entire tweet dataset in Indonesia and 361 users on average post no more than a few tweets a year on such a topic (2.7 tweets over 15 months).

362
Here we check the robustness of the algorithm under extreme challenges involving noise and lack 363 of data with the least mentioned commodity, chilli. Out of the entire 484-day observation period, chilli 364 was not mentioned once over 312 days and fewer than three times over 87 days. To test the robustness 365 of the nowcast algorithm under data scarcity, a random set of chilli-related tweets accounting 10% to 366 80% of total are removed and the price is predicted with only the remaining data. For each simulation, 367 data elimination is repeated 50 times and the averaged performance results are reported for comparison.
368 Figure 6 shows the prediction quality r (Pearson's correlation) as a function of the data deletion ratio. We 369 find the trend forecasting to remain relatively stable until a moderate level of data deletion; the r value is 370 degraded no more than 20% until 40% of data is eliminated. The r value starts to decrease more rapidly 371 after this point although still reaching a correlation of above 50% until 65% of data is eliminated. This changes; people tend to post more tweets during periods of price inflation than price deflation (Fig. 3).

380
This tendency is more apparent on food commodities that often experience volatile price fluctuations.

381
For instance, onion receives on average 11.3 times more tweets upon price inflation than price deflation.

382
Tweet volume is directly related to the richness of the data source for the nowcast model, and hence its 383 performance depends on price trends. The partial correlation between the price change rate and the model 384 error after controlling for tweet volume is considerably lower (r=-0.27).  indicating that accounts with more followers mentioned more accurate food prices (Fig. 7B). Furthermore,

Manuscript to be reviewed
Computer Science