Analysing Social Media Forums to Discover Potential Causes of Phasic Shifts in Cryptocurrency Price Series

The recent extreme volatility in cryptocurrency prices occurred in the setting of popular social media forums devoted to the discussion of cryptocurrencies. We develop a framework that discovers potential causes of phasic shifts in the price movement captured by social media discussions. This draws on principles developed in healthcare epidemiology where, similarly, only observational data are available. Such causes may have a major, one-off effect, or recurring effects on the trend in the price series. We find a one-off effect of regulatory bans on bitcoin, the repeated effects of rival innovations on ether and the influence of technical traders, captured through discussion of market price, on both cryptocurrencies. The results for Bitcoin differ from Ethereum, which is consistent with the observed differences in the timing of the highest price and the price phases. This framework could be applied to a wide range of cryptocurrency price series where there exists a relevant social media text source. Identified causes with a recurring effect may have value in predictive modelling, whilst one-off causes may provide insight into unpredictable black swan events that can have a major impact on a system.

The recent extreme volatility in cryptocurrency prices occurred in the setting of popular social media forums devoted to the discussion of cryptocurrencies. We develop a framework that discovers potential causes of phasic shifts in the price movement captured by social media discussions. This draws on principles developed in healthcare epidemiology where, similarly, only observational data are available. Such causes may have a major, one-off effect, or recurring effects on the trend in the price series. We find a one-off effect of regulatory bans on bitcoin, the repeated effects of rival innovations on ether and the influence of technical traders, captured through discussion of market price, on both cryptocurrencies. The results for Bitcoin differ from Ethereum, which is consistent with the observed differences in the timing of the highest price and the price phases. This framework could be applied to a wide range of cryptocurrency price series where there exists a relevant social media text source. Identified causes with a recurring effect may have value in predictive modelling, whilst one-off causes may provide insight into unpredictable black swan events that can have a major impact on a system.

INTRODUCTION
Social media discussion forums involve hundreds of thousands of subscribers (Comben and Rivet, 2019;Knittel and Wash, 2019) and, as in the case of Reddit subreddits, may use moderators to ensure focus on a specified theme (r/Bitcoin, 2019; r/ethereum, 2019). In this paper, we introduce a framework for analysing the association between changes in social media discussions and shifts in the movement of a related cryptocurrency price series. We evaluate this framework through the insights it provides when applied to bitcoin and ether prices across 2017-18. For cryptocurrencies, social media discussions are particularly relevant (ConsenSys Media, 2019;Revealing Reality, 2019) and, during 2017-18, changes in their price movement were particularly extreme (see Figure 1). Potential causes of shifts in the price series are discovered in social media discussions that either have a one-off, major effect, including unpredictable "black swan" (Taleb, 2010) events, or have a consistently recurring effect on price.
If an event occurs as price changes, that event could be driving the change in price, but a reasonable alternative explanation is that the event is in response to the change in price. To exclude the latter possibility, cause must come before effect as the future cannot affect the past (Bradford Hill, 1965;Granger, 1980;Ioannidis, 2016). Hence, the event must precede the price change, and such events, therefore, may be predictive. Previous literature has focussed on models to predict the cryptocurrency price. For instance, seven studies have found a higher Google search Ether series is in light green and the price is given by the right axis. The horizontal line represents the identified support or resistance price level which was 400 US Dollars for ether and 6,000 US Dollars for bitcoin. The labelled dates on the x-axis are dates where there was a bitcoin or ether local maxima or minima, or where the horizontal line was breached.
However, establishing a predictive relationship does not prove a causal link because of "confounding bias" (Pearl et al., 2016). That is to say if one event occurs before another, both may be the symptoms of a third factor changing (Pearl et al., 2016) or there may have been a catalyst unique to that dataset without which the causal link ceases (Rothman, 2017). For example, higher Google search volumes may occur before higher prices because positive news events drove people to both search on the internet to find out more and to buy the cryptocurrency (Kristoufek, 2013;Liu and Tsyvinski, 2018). Hence, Kristoufek (2013) established that a positive correlation relied on including days in the dataset when the price was high and positive news events common. However, negative news items could also lead people to Google search but instead result in lower prices, resulting in a negative correlation. Consistent with this, Garcia et al. (2014) found a negative correlation and Urquhart (2018) no predictive association between Google searches and price. Confounding bias remains an issue even when applying non-parametric approaches to learning causal networks (Maathuis et al., 2009;Runge et al., 2019); to construct these networks, assumptions are also required regarding the conditional independence between variables (Dablander and Hinne, 2019).
Ideally, experiments would be carried out to reduce the risk of confounding bias (Pearl et al., 2016;Rosenbaum, 2017), but for cryptocurrencies we have only observational data. Although observational data cannot prove that a candidate caused a change it can provide evidence that favours this explanation over confounding bias (Pearl et al., 2016;Rosenbaum, 2017). It is in this context that healthcare epidemiologists often operate to find the underlying causes of disease, as, for instance, with the link between smoking and lung cancer (Cornfield et al., 1959;Rosenbaum, 2017).
Our approach (see Figure 2) is to filter words from social media text, group words of similar meaning to identify the underlying concepts, and then to apply quantitative causality criteria. We then examine the context of the delineated concepts and evaluate the coherence of suggested causal links with known facts (Bradford Hill, 1965). Healthcare epidemiology literature suggests two distinct approaches to constructing the quantitative causality criteria.
The first approach uses the strength of the association to support a causal link (Bradford Hill, 1965;Rosenbaum, 2017). The larger the increase in the candidate cause and the greater the effect, the more any third, unconsidered, "confounding" variable would have to affect both for the association to be spurious and not indicative of a causal relationship (Cornfield et al., 1959;Grimes and Schulz, 2002;Rosenbaum, 2017). This is applicable to identifying rare, unpredictable black swan events that have a one-off influence on a single, major phasic shift in the price series. In the "mono-phase" analysis (see Figure 2) we focus on the major change in the price series which is the shift in movement from the phase of rising prices before to the phase of falling prices after the all time high price. We filter for words that were statistically significantly higher in frequency in the FIGURE 2 | The causality framework. This evaluates evidence for or against an event and/or concern on social media having an impact on price. The framework begins in the box labelled "Data Preparation." The mono-phase analysis follows the route on the left and the multi-phase analysis follows the route on the right; differences in approach are indicated by coloured text. The process terminates in the box labelled "Coherence with Known Facts". latter phase of falling values. The causality criteria used are: frequency is more than three-fold higher (Grimes and Schulz, 2002) in the phase of falling prices than the phase of rising prices, and frequency is higher within the 24 h before the maximum price. We use a cut-off that the concept must be more than three-fold higher in frequency to reduce the risk that the detected association is spurious. This is consistent with recommendations in the epidemiology literature regarding the definition of what constitutes "strong support for causation" (Grimes and Schulz, 2002).
The alternative approach places value in relationships that consistently recur despite a changing context (Bradford Hill, 1965;Ioannidis, 2016). The more an observed association recurs across different contexts, the more likely any unobserved variables would have changed in value and impact, and so the less likely that the observed association is due to some unobserved variable driving both candidate cause and effect. This approach can detect potential causes with a recurring effect on the price series. In the "multi-phase" analysis (see Figure 2), we filter for words where daily frequency was statistically significantly different comparing all phases of rising values with all phases of falling values. A concept captured a potential recurring cause of rising values if its frequency was higher in every phase of rising values compared with the previous phase and higher within the 24 h before each phase of rising values. Concepts reflecting potential causes of falling values have a higher frequency in every phase of falling values compared with the previous phase and a higher frequency within the 24 h before each phase of falling values.
Our results support the existence of both causes with a oneoff effect, which could be attributable to black swan events, and causes with a consistently recurring effect on price. Most of the causes differed between bitcoin and ether which is consistent with the difference in timing of the phases and all time high price (see Figure 1), and their different functions (Burnie et al., 2018).

MATERIALS AND METHODS
An overview of the methodology is provided in Figure 2. For Ethereum, the largest (Comben and Rivet, 2019) subreddit "r/ethereum" had 436,000 subscribers on 14 May 2019 (r/ethereum, 2019) and was moderated by Vitalik Buterin, the "Creator of Ethereum" (Alvarez, 2018). Following this forum's guidelines (r/ethereum, 2019), its text was combined with that from "r/ethtrader" and "r/EtherMining." Together, these had the most submissions containing the term "ether" or "eth" among Ethereum-specific subreddits (Baumgartner, 2019) and have collectively been described as the most important subreddits (Comben and Rivet, 2019). For Bitcoin, we used subreddit "r/Bitcoin, " which has been recommended because of the number and activity of its users compared with alternative online communities (Knittel and Wash, 2019); this community had over 1.1 million subscribers as of 18:54 (UTC) on 23 August 2019 (r/Bitcoin, 2019).

Dividing the Price Series Into Phases
The price data was divided into phases using local maxima and minima to define the boundaries. A date represented a local maximum if the price was higher than on any other date 28 days (4 weeks) before and after. That date was a local minimum if the price was instead lower than on any other date 28 days before and after. Phases terminating just before a local maximum were rising price phases, those ending just before a local minimum were falling price phases. Sometimes there were several consecutive minima with the last value being the lowest; we ignored all such minima except the last, lowest value.
The length of the window was specified at 28 days before and after because a longer window risked merging rising and falling price phases. For example, examining bitcoin, the 28-day window delineated a phase where bitcoin prices fell 65% from the all time high price on 16 December 2017 to 5 February 2018 (see Figure 1 and Figure S1). Doubling the length of this window to 56 days would have enlarged this phase of price movement to include the subsequent 70% increase in prices from 5 February 2018 to 5 March 2018. Using shorter time windows would have reduced the size of the price phases, limiting the amount of data available when applying Wilcoxon Rank-Sum Tests to filter words in the mono-phase analysis (described in section 2.2.1). This would have reduced the power of such tests (Bridge and Sawilowsky, 1999).
As bitcoin prices rose across 2017, there were brief phases where bitcoin prices reversed upon reaching round values. This occurred at 1,000 US Dollars (1285.14 to 941.92 from 3 to 24 March 2017); 3,000 US Dollars (2961.83 to 1931.21 from 11 June to 16 July 2017); and 5,000 US Dollars (4911.74 to 3319.63 from 1-14 September 2017). Traders sell at round values that represent a large return on their investment to prevent losing this return to subsequent volatility, even if their view of the cryptocurrency is unchanged (Chen, 2018). Therefore, we incorporated these phases into the overall rising price phase.
When technical traders believe that a certain price level is a support or resistance level, they will buy (pushing prices up) as prices fall to that support level and sell (pushing prices down) as prices rise to that resistance level (Murphy, 2019). When prices approach a round-valued price this can drive reversals in trend even if opinion of the cryptocurrency is otherwise unchanged (Shiller, 2000;Westerhoff, 2003;Aggarwal and Lucey, 2007;Dowling et al., 2016). Phases where the connect between price and non-price events and concerns is weak were excluded.
In 2017, the ether price rose to 394.66 US Dollars (12 June), fell to near 150 US Dollars (155.42 US Dollars, 16 July 2017), then rose again to 391.42 US Dollars (1 September 2017) (Figure 1). This supports a 400 US Dollar price resistance level identified by the media at the time (Bamburic, 2017;Wilmoth, 2017). Hence, we remove from analysis the phase from 12 June (where the barrier was first neared) to before 23 November 2017 (when the barrier was exceeded).
In 2018, the bitcoin price fell to 5908.70 US Dollars (29 June 2018), recovered and tested the barrier again at 6050.94 (14 August 2018). Hence the 6000 US Dollar support level has been described as a "crucial test" (Cuthbertson, 2019). We remove from analysis the phase from 29 June 2018 to before 15 November 2018 (when prices finally fell below the barrier).
After attaining a local minimum in mid-December 2018, neither the bitcoin nor ether price fell further. This point thus marks the end of the 2017-18 price cycle which is the focus of this paper's analyses, and so the last phase of data analysed ends mid-December 2018 for both cryptocurrencies (14 December for Ethereum and 15 December for Bitcoin).

Text Preparation
Reddit submissions were processed as detailed in the Supplementary Methods (see section 1.1), in the Supplementary Data Sheet 1. Table 1 uses examples to illustrate the different datasets generated during the processing of the text. Blank, duplicate and automated submissions were removed, text of synonymous meaning was standardised and text not relating to words deleted. Each submission was converted from a string of text into a list of words; see columns (A) and (B) in Table 1 for examples.

Measuring Frequency
With each submission represented as a list of words, the number of submissions across a defined time period that contained each word could be counted. This was then divided by the total number of submissions such that the "frequency" or "popularity" of a word was the proportion of submissions across a defined time period that contained that word at least once. Extending to groups containing multiple words, frequency was the proportion of submissions containing at least one word from that group. Daily frequency referred to the proportion of submissions containing a word or a word from a group on each day. Following the sources on price data (Blockchain Luxembourg S. A., 2019; Etherscan, 2019), a "day" was specified to be from 00:00 on a given day to before 00:00 the next date (UTC). Table 1 provides example daily frequency data for the word "bitcoin."

Identify Concepts
From the delineated words, concepts were derived that consisted of one or more words that shared a similar meaning. This followed Burnie and Yilmaz (2019a) and used Python packages "gensim" (Řehůřek, 2019) version 3.5.0 and "NetworkX" (NetworkX, 2019) version 2.2. Firstly, word2vec models (Mikolov et al., 2013a,b) were trained using the processed text from all submissions (see section 2.1.3). The trained word2vec model was used to convert each delineated word (found in section 2.2.1) into a numeric vector. A network was constructed where two words were connected only if the cosine similarity between their vectors exceeded a threshold. The cosine similarity between a pair of vectors provided a measure of how similar the pair of words were in meaning (Mikolov et al., 2013a,b). Groups of connected words were merged into single concepts (such as "cardano"/"eo"/"iota"/"rippl"/"stellar"/"tron") whilst words unconnected with any other word ("korea") were treated as concepts consisting of only one word. The optimisation of this methodology followed Burnie and Yilmaz (2019a).

Apply Causality Criteria: Strength and Cause Before Effect
Mono-phase concepts were more than three-fold higher in popularity (Grimes and Schulz, 2002) across the phase after the all time high price compared with the phase before, and increased in frequency before the shift in phase. To determine if frequency rose before the shift, we examined 1, 2, 3 h, and so on, up to 24 h before the shift and evaluated whether the proportion of submissions containing the concept within any of these windows was higher compared with all the submissions in the same phase but before that window.

Filter Words
Two-tailed Wilcoxon Rank-Sum Tests (SciPy package version 1.1.0) and a Bonferroni-corrected p-value threshold of 1% were applied to extract those words where the daily word frequency tended to be higher or lower comparing all phases where prices rose with all phases where prices fell. Prior to this, extremely rare words in 100 or less submissions were removed.

Identify Concepts
Words more frequent as prices rose were split from those more popular as prices fell. As in section 2.2.2, each set of words was converted into a set of concepts: "rising-price concepts" consisted of words higher in frequency as prices rose and "falling-price concepts" consisted of words more frequent as prices fell.

Apply Causality Criteria: Consistency and Cause Before Effect
Rising-price, multi-phase concepts were rising-price concepts that rose in frequency with every shift to rising prices and within the 24 h before every shift to rising prices. Falling-price, multiphase concepts were falling-price concepts that rose in frequency with every shift to falling prices and within the 24 h before every shift to falling prices. We removed from the analysis any concept that consistently rose in popularity across every shift in price, independent of whether prices were rising or falling, as any rise in popularity could have been an artefact of the long-term trend.

Context of Concepts
For each mono-phase and multi-phase concept, we found the top five most common words occurring in submissions containing at least one word from that concept. This excluded words that did not aid in the interpretation of the concept. Further details and a list of words excluded are available in section 1.2 of the Supplementary Methods, in the Supplementary Data Sheet 1.

Comparison of Bitcoin and Ethereum Price Phases
Both the bitcoin and the ether price rose to an all time high as 2017 became 2018, to then oscillate with an overall decline in value until mid-December 2018 (see Figure 1). There was a disparity in the timing of the all time high price for bitcoin (16 December 2017) and ether (13 January 2018).
It appears that different price levels acted as barriers at different times. Whilst bitcoin prices rose across 2017, ether prices reverted upon nearing 400 US Dollars (Bamburic, 2017;Wilmoth, 2017) (12 June and 1 September 2017), only increasing above this level after five months. Whilst ether prices fell from 5 May to mid-December 2018, bitcoin prices recovered upon falling to 6,000 US Dollars (Cuthbertson, 2019) (29 June and 14 August 2018) and only fell below this level after four months.
Based on local extrema (see Figure S1) and price barriers, we demarcated six phases of price movement with ether and eight with bitcoin (see Table 2). Table 2 further shows which of these phases were used in order to compare daily word frequencies so as to filter words (see sections 2.2.1 and 2.3.1). Descriptive statistics for the different phases are provided in Table S7.

Mono-phase Concepts and Their Context
Ether prices rose 241% (phase 3) to an all time high price on 13 January 2018 before falling 73% (phase 4). Only "feb" met the criteria for a mono-phase concept and was excluded as it reflected the timing of phase 4.
The context of the altcoin group ("cardano" /"eo"/"iota"/"rippl"/"stellar"/"tron") reflected the contexts of each cryptocurrency named. Three of these six cryptocurrencies increased more than three-fold in the proportion of submissions from phase 1 to 2: Cardano rose 721.44%; Tron 562.63%; and Ripple (represented by "rippl") 309.36%. We examined the top five words occurring with each of Cardano, Tron and Ripple and the altcoin group ("cardano"/"eo"/"iota"/"rippl"/"stellar"/"tron") and found in each case they were discussed with: "ethereum, " "buy, " price ("price" or US Dollars) and another cryptocurrency ("bitcoincash" or "rippl" and "verg" in the case of Tron). Further details in Table 4. We also split up the concept "binanc"/"hitbtc" which combines two different cryptocurrency exchanges: Binance and HitBTC. Interest in Binance rose 1327.89% in frequency compared with only 163.55% for HitBTC. The context in which "binanc" was used was similar to the concept "binanc"/"hitbtc, " with the top ten words being shared and the top three words having the same ranking ("coinbas, " US Dollar mentions and send). Further details in Table 5.

Multi-Phase Concepts and Their Context
With Bitcoin, two multi-phase concepts were linked to falling prices: "market" and "sale." The top two words occurring with "market" were "price" and US Dollars across each phase of falling prices. The concept "sale" was discussed in a varying context in different phases of falling prices: with "buy[ing]" and "sell[ing]" in phases 2 and 6, "token" sales in phases 4 and 6 and "black" "friday" sales in phase 8 (see Table 6).
With Ethereum, ten multi-phase concepts were identified. Three of these were associated with rising prices: "tax, " US Dollars and "hit." "Hit" was discussed with US Dollars (over 40% submissions in each phase of rising prices) and US Dollars were frequently discussed with "bitcoin"(over 15%). The concept "tax" was considered with "gain" (over 30% submissions in each phase of rising prices); "pay" (over 25%); US Dollars (over 24%) and "trade" (over 23%). Further details in Table 7.
The remaining seven multi-phase concepts related to falling ether prices. With the exception of "game, " all these could be split into two themes: price ("market" and "bear"/"bearish"/"bull") and innovation ("featur"; "ceo"/"cofound"; "project"/"team"; and "makerdao"/"stablecoin"). In each phase of falling prices, "bear"/"bearish"/"bull" was discussed with "market" (over 45% submissions) and "market" was discussed with US Dollars (over 20%) and "price" (over 18%). Price was discussed in the context of "bitcoin, " which was in over 16% "market" submissions. The context of discussions around innovation varied but referred to new "token[s]" in over 10% submissions across all concepts and across all phases of falling prices. The concept "game" was discussed in the context of using gaming machines to mine ether in phase 4 (24.39% submissions) and "play[ing]" games in phase 6 (14.62% submissions). Further details in Table 8.

The Supplementary
Results, in the Supplementary Data Sheet 1 provide further detail on the percentage change in popularity for Bitcoin multi-phase concepts (see Table S8) and Ethereum multi-phase concepts (Table S9).

Coherence With Known Facts
Of the Bitcoin mono-phase themes (see Table 3), regulatory bans are the closest to capturing a specific external event. Discussion of "korea" and "minist"/"ministri" occurred with the debate between the Ministry of Finance and Justice in South Korea as to whether a ban on cryptocurrency trading activity should be implemented, with one proposal being that cryptocurrencies are a scam that should be subject to criminal charges (Jaewon, 2017). On 16 December 2017, when prices changed to falling, South Korean news media reported how North Korea was using hacks of South Korean exchanges to fund its regime, encouraging South Korean support for a ban (Harper, 2017). This could have triggered South Koreans to sell bitcoin holdings before this became illegal and possibly even criminal (Jaewon, 2017). Since approximately a fifth of bitcoin transactions were in South Korean Won at the time (Jaewon, 2017), it is coherent with known events that this caused the shift from rising to falling prices. The presence of "india" in 23.64% "minist"/"ministri" submissions may reflect concerns over bitcoin regulation, including rumours of a possible ban in India during phase 2 (Lomas, 2018).
"token[s]" (≥ 20.48%) through Initial Coin Offerings ("ico"; ≥ 16.27%). Mentioned in relation to this was "ceo"/"cofound" ("project" ≥ 11.11% submissions) and "featur" ("project" ≥ 15.51% submissions). A separate innovation theme related to interest in MakerDAO, which was launched in December 2017 enabling holders to exchange their ether for Dai, a decentralised "stablecoin" designed to maintain its value in US Dollars (MakerDAO, 2018). For Ethereum, price discussed in the context of "hit" was supported as causing prices to rise whilst "market" price and sentiment ("bear"/"bearish"/"bull") discourse were associated with price falls (see Tables 7, 8). These discussions happened in the context of "bitcoin" which was a top five co-occurring word throughout. This suggests a source of ether price volatility was traders analysing the ether price and comparing it with bitcoin before buying or selling ether.
The multi-phase concept "market" was identified as a consistent driver for both falling bitcoin prices and falling ether prices. This was discussed in the context of price as well as buying, trading, and selling (see Tables 6, 8). This supports the widespread influence of technical traders who use just price information to make trading decisions on cryptocurrency price series and is consistent with evidence for price barriers at 400 US Dollars for ether and 6,000 US Dollars for bitcoin (see Figure 1).
Including contextual analysis in the framework has shown that some multi-phase concepts were polysemic-being used in a different context in different price phases. In some cases, this could be because the concept is an artefact of distinct themes of discussion each happening to include the polysemic concept. For instance, in the case of Ethereum, "game" was used in the context TABLE 4 | Top five words occurring with each of Cardano, Tron and Ripple ("rippl") compared with the Bitcoin mono-phase concept "cardano"/"eo"/"iota"/"rippl"/"stellar"/"tron" in phase 2 of the bitcoin price series.
Both include the word "game" but are otherwise distinct issues and so examining the context reveals that "game" is probably a spurious result. In contrast, with Bitcoin, the polysemic concept "sale" became popular in all four phases of falling prices making coincidence less plausible (see Table 6). The concept "sale" was mentioned in terms of "buy[ing]" and "sell[ing]" in phases 2 and 6, a "token" sale in phases 4 and 6 and "black" "friday" sales in phase 8.
"binanc"/"hitbtc" "binanc" "Frequency" is the percentage of submissions containing each word, providing the context of the word "binanc" or concept "binanc"/"hitbtc." "DMS" is an abbreviation for "dollarmarkersymbol," used to represent mentions of US Dollars.  For "sale" to be irrelevant to price, distinct, irrelevant themes including "sale" would have to arise at the correct time across four different phases (falling price phases 2, 4, 6, and 8) and within 24 h before each phase to meet the multi-phase concept criteria. A tenable explanation is that "sale" is a general term that captures concern regarding bitcoin before decisions to sell. If holders are concerned about bitcoin, they could be more sensitive to any "sale" of bitcoin (phases 2 and 6); more interested in "token" "sale[s]" to exchange bitcoin for other tokens (phases 4 and 6); and more tempted by "black" "friday" "sale[s]" where bitcoins are exchanged for discounted products or sold to generate cash to buy such products (phase 8). This suggests the concept "sale" may have value as a negative sentiment indicator that warns of future falls in price. The association of "tax" with rising ether prices could be explained by the timing of phases 3 and 5, which coincided with the end of tax years when "pay[ment]" of "capit[al]" "gain[s]" "tax" becomes due (see Table 7). The end of the tax year in some countries, such as the USA (Kagan, 2019), is on 31 December (phase 3 is from 23 November 2017 to 13 January 2018) but in the UK on 5 April (phase 5 was from 6 April to 5 May 2018) (Frecknall-Hughes, 2016).

DISCUSSION
Our framework identifies plausible causes of the shifts in ether and bitcoin price trends. Approaches from healthcare epidemiology are deployed that facilitate this move from simply observing how word (Burnie and Yilmaz, 2019b) or topic (Burnie and Yilmaz, 2019a) interest changed across phases in price to identifying the potential causes of these phasic shifts. We find that the framework has to accommodate two distinct types of cause: the "multi-phase" that repeatedly cause shifts and the "mono-phase" with a one-off, strong impact. The results for Bitcoin differ from Ethereum, which is consistent with the observed differences in the timing of the highest price and the price phases. We identify a one-off effect of regulatory bans on bitcoin, a repeated effect of rival innovations on ether and the influence of technical traders, captured through market price discourse, on both cryptocurrencies. Traders seem to be comparing the prices of different cryptocurrencies: the Ethereum multi-phase concepts discussed with price commonly referred to "bitcoin, " and the Bitcoin mono-phase concept covering altcoins ("cardano"/"eo"/"iota"/"rippl"/"stellar"/"tron") was discussed with US Dollars.
Previous social media analyses typically required judgement on which metric was most suitable in extracting insights from the social media text. For instance, this pre-selected metric could be a measure of sentiment or be based on a topic modelling algorithm. It was only after the values of the metric had been found that the price data were considered, in testing the association between changes in the metric and price (Kaminski, 2014;Garcia and Schweitzer, 2015;Georgoula et al., 2015;Matta et al., 2015;Kim et al., 2016Kim et al., , 2017Abraham et al., 2018;Steinert and Herff, 2018).
We move from causal inference, where judgement is required to pre-select which potential causes and what causal mechanism should be tested (Runge et al., 2019), to causal location, where the best supported causes are located from among social media text. This enables the discovery of new potential causes of price variation which may not have otherwise been considered for testing. None of the potential causes identified (innovation, regulatory bans and technical traders) were suggested by Kim et al. (2017) in a previous analysis of the link between social media topics and bitcoin price. The approach of Kim et al. (2017) required judgement in expanding the list of words within each concept, tested for linear, predictive associations, and did not build a causal argument.
The risk that a concept was spurious was reduced by examining the words within the concept and the words used with that concept, and considering their coherence with known facts (see section 3.4). Concepts containing the word "feb" or the words "christma"/"holiday"/"xmas" were probably spurious, and could be attributed to the time of year as a confounding factor. Concepts given in bold and grouped into themes (in capitals). "Frequency" is the percentage of submissions containing each word, providing the context of that concept. "DMS" is an abbreviation for "dollarmarkersymbol," used to represent mentions of US Dollars.
The words within the delineated concepts relating to exchanges ("binanc"/"hitbtc" and "changelli") did not, in themselves, suggest the influence of a confounding factor. However, these concepts were discussed with "send, " "transact" and US Dollar references (see Table 3). Hence, contextual analysis suggests that discussions of exchanges were more plausibly a response to fears over bitcoin price leading to discussion of how best to dispose of bitcoin, rather than a primary cause of falling prices. This contrasts with the concept "korea, " that was used with "ban" (Table 3), supporting rumours of a South Korean ban as precipitating the fall from the all time high price. Multi-phase concepts may have implications for predictive analysis, since these concepts have a predictive association with price that persists across time. Multi-phase concepts may provide an improvement on sentiment metrics such as VADER that have found social media posts to be positive even during falling prices (Abraham et al., 2018). This extends to polysemic concepts, if their context supports such concepts as acting as proxies for positive or, in the case of "sale, " negative sentiment. The concept "market" was supported as a consistent driver of falling prices for both bitcoin and ether. However, the other multi-phase concepts differed, suggesting that different predictors may be suitable for different cryptocurrencies. Predictive modelling faces the limitation of one-off, impactful "mono-phase" events shaping the price trend. These may be considered analogous to "black swan" (Taleb, 2010) events, being unexpected and having a major impact, but they can be rationalised with the benefit of hindsight.
Future work could examine whether black swan events can be found in cryptocurrencies other than Bitcoin and whether such events are shared or unique to a specific cryptocurrency. Better understanding of the causes of shifts between price phases will help investors in diversifying their cryptocurrency investments to reduce risk.

DATA AVAILABILITY STATEMENT
The sources of data are listed in section 2.1.1. The code used to prepare and analyse this data is publicly accessible in a Dryad data repository (Burnie et al., 2019) at: https://datadryad. org/stash/share/__NFDfahKD5bNOQEszg2pYv6OBg_bIOv3bb mntMJObs.

AUTHOR CONTRIBUTIONS
AB processed and analysed the data, and drafted the article. EY and TA provided critical feedback on the article, inputting on the data processing and analysis approaches taken. All authors gave final approval for publication and agree to be held accountable for the work performed therein.

FUNDING
This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and Turing award number TU/C/000028. This project was partially funded by the EPSRC Fellowship titled Task Based Information Retrieval (grant reference number EP/P024289/1); BARAC project (EP/P031730/1); and FinTech project (H2020-ICT-2018-2 825215).