Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets

The compute unified device architecture is an almost conventional programming approach for managing computations on a graphics processing unit (GPU) as a data-parallel computing device. With a maximum number of 240 cores in combination with a high memory bandwidth, a recent GPU offers resources for computational physics. We apply this technology to methods of fluctuation analysis, which includes determination of the scaling behavior of a stochastic process and the equilibrium autocorrelation function. Additionally, the recently introduced pattern formation conformity (Preis T et al 2008 Europhys. Lett. 82 68005), which quantifies pattern-based complex short-time correlations of a time series, is calculated on a GPU and analyzed in detail. Results are obtained up to 84 times faster than on a current central processing unit core. When we apply this method to high-frequency time series of the German BUND future, we find significant pattern-based correlations on short time scales. Furthermore, an anti-persistent behavior can be found on short time scales. Additionally, we compare the recent GPU generation, which provides a theoretical peak performance of up to roughly 1012 floating point operations per second with the previous one.


Introduction
In computer science applications and diverse interdisciplinary science fields such as computational physics or quantitative finance, the computational power requirements increase monotonically in time. In particular, the history of time series analysis mirrors the needs of computational power and simultaneously the opportunities arising from the use of it. Up to the present day, it is an often made simplistic assumption that price dynamics in financial time series obey random walk statistics in order to simplify analytic calculations in econophysics and in financial applications. However, such approximations, which are, e.g. used in the famous options pricing model introduced by Black and Scholes [1] in 1973, neglect the real nature of financial market observables, and a large number of empirical deviations between financial market time series and models presuming only a random walk behavior have been observed in recent decades [2]- [6]. Already Mandelbrot [7,8] discovered in the 1960s that commodity market time series obey fat-tailed price change distributions [9]. His analysis was based on historical cotton times and sales records dating back to the beginning of the 20th century. In accordance with the technological improvements in computing resources, trading processes were adapted in order to create full-electronic market places. Thus, the available amount of historical price data increased impressively. As a consequence, the empirical properties found by Mandelbrot were confirmed. However, the amount of transaction records available today in time units of milliseconds also requires increased computing resources for its analysis. From such analyses, scaling behavior, short-time anti-correlated price changes and volatility clustering [10,11] of financial markets are well established and can be reproduced, e.g. by a model of the continuous double auction [12,13] or by various agent-based models of financial markets [14]- [22]. Furthermore, the price formation process and cross correlations [23,24] between equities and equity indices have been studied with the clear intention to optimize asset allocation and portfolios. However, in contrast to such calculations, which can be done with conventional computing facilities, a large computational power demand is driven by the quantitative hedge fund industry and also by modern market making, which requires mostly real time analytics. A market maker usually provides quotes for buying or selling a given asset. In the competitive environment of electronic financial markets this cannot be done by a human market maker alone, especially if a large number of assets is quoted concurrently. The rise of the hedge fund industry in recent years and their interest in taking advantage of short time correlations boosted the real-time analysis of market fluctuations and the market micro-structure analysis in general, which is the study of the process of exchanging assets under explicit trading rules [25], and which is studied intensely by the financial community [26]- [29].
Such computing requirements, which can be found in various interdisciplinary computer sciences like computational physics including, e.g. Monte Carlo and molecular dynamics simulations [30]- [32] or stochastic optimization [33], make use of the high-performance computing resources necessary. This includes recent multi-core computing solutions based on a shared memory architecture, which are accessible by OpenMP [34] or MPI [35] and can be found in recent personal computers as a standard configuration. Furthermore, distributed computing clusters with homogeneous or heterogeneous node structures are available in order to parallelize a given algorithm by separating it into various sub-algorithms.
However, a recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance, i.e. the required processing times can be reduced to a great extent. Some applications have already been realized in computational physics [31], [36]- [43]. Recently, the Monte Carlo simulation of the two-dimensional and three-dimensional ferromagnetic Ising model could be accelerated up to 60 times [44] using a graphic card architecture. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics processing. In the beginning, GPU programs used C-like programming environments for kernel execution such as OpenGL shading language [45] or C for graphics (Cg) [46]. The compute unified device architecture (CUDA) [47] is an almost conventional programming approach making use of the unified shader design of recent GPUs from NVIDIA corporation. The programming interface allows one to implement an algorithm using standard C language without any knowledge of the native programming environments. A comparable concept 'Close To the Metal' (CTM) [48] was introduced by Advanced Micro Devices Inc. for ATI graphics cards. One has to state that computational power of consumer graphics cards roughly exceeds that of a central processing unit (CPU) by 1-2 orders of magnitudes. A conventional CPU nowadays provides a peak performance of roughly 20 × 10 9 floating point operations per second (FLOPS) [31]. The consumer graphics card NVIDIA GeForce GTX 280 reaches a theoretical peak performance of 933 × 10 9 FLOPS. If one tried to realize the computational power of one GPU with a cluster of several CPUs, a much larger amount of electric power would be required. A GTX 280 graphics card exhibits a maximum power consumption of 236 W [49], while a recent Intel CPU consumes roughly 100 W.
We apply this general-purpose graphics processing unit (GPGPU) technology to methods of time series analysis, which includes determination of the Hurst exponent and equilibrium autocorrelation function. Additionally, the recently introduced pattern conformity observable [50], which is able to quantify pattern-based complex short-time correlations of a time series, is calculated on a GPU. Furthermore, we compare the recent GPU generation with the previous one. All methods are applied to a high-frequency data set of the Euro-Bund futures contract traded at the electronic derivatives exchange Eurex.
The paper is organized as follows. In section 2, a brief overview of key facts and properties of the GPU architecture is provided in order to clarify implementation constraint details for the following sections. A GPU-accelerated Hurst exponent estimation can be found in section 3. In section 4, the equilibrium autocorrelation function is implemented on a GPU and in section 5, the pattern conformity is analyzed on a GPU in detail. In each of these sections, the performance of the GPU code as a function of parameters is first evaluated for a synthetic time series and compared to the performance on a CPU. Then the time series methods are applied to a financial market time series and a discussion of numerical errors is presented. Finally, our conclusions are summarized in section 6.

GPU device architecture
In order to provide and discuss information about implementation details on a GPU for time series analysis methods, key facts of the GPU device architecture are briefly summarized in this section. As mentioned in the introduction, we use the compute unified device architecture (CUDA), which allows the implementation of algorithms using standard C language with CUDA specific extensions. Thus, CUDA issues and manages computations on a GPU as a data-parallel computing device.
The graphics card architecture, which is used in recent GPU generations, is built around a scalable array of streaming MPs [47]. One such MP contains amongst others eight scalar processor cores, a multi-threaded instruction unit, and shared memory, which is located on-chip. When a C program using CUDA extensions and running on the CPU invokes a GPU kernel, which is a synonym for a GPU function, many copies of this kernel-known as threadsare enumerated and distributed to the available MPs, where their execution starts. For such an enumeration and distribution, a kernel grid is subdivided into blocks and each block is subdivided into various threads as illustrated in figure 1 for a two-dimensional thread and block structure. The threads of a thread block are executed concurrently in the vacated MPs. In order to manage a large number of threads, a single-instruction multiple-thread (SIMT) unit is used. An MP maps each thread to one scalar processor core and each scalar thread executes independently. Threads are created, managed, scheduled and executed by this SIMT unit in groups of 32 threads. Such a group of 32 threads forms a warp, which is executed on the same MP. If the threads of a given warp diverge via a data-induced conditional branch, each branch of the warp is executed serially and the processing time of this warp consists of the sum of the branches' processing times.
As shown in figure 2, each MP of the GPU device contains several local 32-bit registers per processor, memory that is shared by all scalar processor cores of an MP. Furthermore, constant and texture cache are available, which is also shared on an MP. In order to afford reducing results of involved MPs, slower global memory can be used, which is shared among all MPs and is also accessible by the C function running in the CPU. Note that the GPU's global memory is still roughly 10 times faster than the current main memory of personal computers. Detailed facts of the consumer graphics cards 8800 GT and GTX 280 used by us can be found in table 1. Furthermore note that a GPU device only supports single-precision floating-point operations, with the exception of the most modern graphic cards starting with the GTX 200 series. However, the IEEE-754 standard for single-precision numbers is not completely realized. Deviations can be found especially for rounding operations. In contrast, the GTX 200 series supports also double-precision floating-point numbers. However, each MP features only one double-precision processing core and so, the theoretical peak performance is dramatically reduced for doubleprecision operations. Further information about the GPU device properties and CUDA can be found in [47].

Hurst exponent
The Hurst exponent H [51] provides information on the relative tendency of a stochastic process. A Hurst exponent H <0.5 indicates an anti-persistent behavior of the analyzed process, which means that the process is dominated by a mean reversion tendency. H >0.5 mirrors a super-diffusive behavior of the underlying process. Extreme values tend to be followed by extremal values. If the deviations from the mean values of the time series are independent, which corresponds to a random walk behavior, a Hurst exponent of H = 0.5 is obtained. The Hurst exponent H was originally introduced by Harold Edwin Hurst [52], a British government administrator. He studied records of the Nile river's volatile rain and drought conditions and noticed interesting coherences for flood periods. Hurst observed in the eight centuries of records that there was a tendency for a year with good flood conditions to be followed by another year with good flood conditions. Nowadays, the Hurst exponent as a scaling exponent is well studied in context of financial markets [50], [53]- [56]. Typically, an antipersistent behavior can be found on short timescales due to the nonzero gap between offer and demand. On medium timescales, a super-diffusive behavior can be detected [54]. On long timescales, a diffusive regime is reached, due to the law of large numbers.
For a time series p(t) with t ∈ {1, 2, . . . , T }, the time lag-dependent Hurst exponent H q ( t) can be determined by the general relationship with the time lag t T and t ∈ N. The brackets . . . denote the expectation value. Apart from (1), there are also other calculation methods, e.g. the rescaled range analysis [51]. We present in the following the Hurst exponent determination implementation on a GPU for q = 1 and use H ( t) ≡ H q=1 ( t). The process to be analyzed is a synthetic anti-correlated random walk, which was introduced in [50]. This process emerges from the superposition of two random walk processes with different timescale characteristics. Thus, a parameter-dependent anti-correlation at time lag one can be realized. As a first step, one has to allocate memory on the GPU device's global memory for the time series, intermediate results and final results. In a first approach, the time lag-dependent Hurst exponent is calculated up to t max = 256. In order to simplify the reduction process of the partial results, the overall number of time steps T has to satisfy the condition T = (2 α + 1) × t max , with α being an integer number called the length parameter of the time series. The number of threads per block-known as block size-is equivalent to t max . The array for intermediate results possesses length T too, whereas the array for the final results contains t max entries. After allocation, the time series data have to be transferred from the main memory to the GPU's global memory. If this step is completed, the main calculation part can start. As illustrated in figure 3 for block 0, each block, which contains t max threads each, loads t max data points of the time series from global memory to shared memory. In order to realize such a high-performance loading process, each thread 4 loads one value and stores this value in the array located in the shared memory, which can be accessed by all threads of a block. Analogously, each block also loads the next t max entries. In the main processing step, each thread is in charge of one specific time lag. Thus, each thread is responsible for a specific value of t and summarizes the terms | p(t + t) − p(t)| in the block sub-segment of the time series. As the maximum time lag is equivalent to the maximum number of threads and as the maximum time lag is also equivalent to half the data points loaded per block, all threads have to summarize the same number of addends and so, a uniform workload of the graphics card is realized. However, as it is only possible to synchronize threads within a block and a native block synchronization does not exist, partial results of each block have to be stored in block-dependent areas of the array for intermediate results, as shown in figure 3. The termination of the GPU kernel function ensures that all blocks were executed. In a post processing step, the partial arrays have to be reduced. This is realized by a binary tree structure, as indicated in figure 3. After this reduction, the resulting values can be found in the first t max entries of the intermediate array and a final processing kernel is responsible for normalization and gradient calculation. The source code of these GPU kernel functions can be found in the appendix.
For the comparison between CPU and GPU implementations, we use an Intel Core 2 Quad CPU (Q6700) with 2.66 GHz and 4096 kB cache size, of which only one core is used. The A smaller speed-up factor can be measured for small values of α, as the relative fraction of allocation time and time for memory transfer is larger in comparison to the time needed for the calculation steps. The corresponding analysis for the GTX 280 yields a larger acceleration factor β of roughly 70. If we increase the maximum time lag t max to 512, which is only possible for the GTX 280, a maximum speed-up factor of roughly 80 can be achieved, as shown in figure 5. This indicates that t max = 512 leads to a higher workload on the GTX 280. At this point, we can also compare the ratio between the performances of 8800 GT and GTX 280 for our application to the ratio of theoretical peak performances. The latter is given as the number of cores multiplied with the clock rate, which amounts to roughly 1.84. If we compare the total processing times on these GPUs for α = 15 and t max = 256, we obtain an empirical performance ratio of 1.7. If we use the acceleration factors for t max = 256 on the 8800 GT and for t max = 512 on the GTX 280 for comparison, we obtain a value of 2.  After this performance analysis, we apply the GPU implementation to real financial market data in order to determine the Hurst exponent of the Euro-Bund futures contract traded at the European exchange (Eurex). In this context, we will also gauge the accuracy of the GPU calculations by quantifying deviations from the calculation on a CPU. The Euro-Bund futures contract (FGBL) is a financial derivative. A futures contract is a standardized contract to buy or sell a specific underlying instrument at a proposed date in the future, which is called the expiration time of the futures contract, at a specified price. The underlying instruments of the FGBL contract are long-term debt instruments issued by the Federal Republic of Germany with remaining terms of 8. In all presented calculations of the FGBL time series on the GPU, α is fixed to 11. Thus, the data set is limited to the first T = 1 049 088 trades in order to fit the data set length to the constraints of the specific GPU implementation. In figure 7, the time lag-dependent Hurst exponent H ( t) is presented. On short timescales, the well-known anti-persistent behavior can be detected. On medium timescales, small evidence is given, that the price process reaches a super-diffusive regime. For long timescales the price dynamics tend to random walk behavior (H = 0.5), which is also shown for comparison. The relative error shown in the inset of figure 7 is smaller than one-tenth of a per cent. In all presented calculations of the FGBL time series on a GPU, α is fixed to 11. Thus, the data set is limited to the first T = 1 049 088 trades in order to fit the data set length to the constraints of the specific GPU implementation.

Equilibrium autocorrelation
The autocorrelation function is a widely used concept in order to determine dependencies within a time series. The autocorrelation function is given by the correlation between the time series and the time series shifted by the time lag t through For a stationary time series, (4) reduces to as the mean value and the variance stay constant, i.e. p(t) = p(t + t) and p(t) 2 = p(t + t) 2 .
Applied to financial markets it can be observed that the autocorrelation function of price changes exhibits a significant negative value at time lag one tick, whereas it vanishes for time lags t >1. Furthermore, the autocorrelation of absolute price changes or squared price changes, which is related to the volatility of the price process, decays slowly [15]. In order to implement (5) on a GPU architecture, similar steps as in section 3 are necessary. The calculation of the time lag-dependent part p(t) · p(t + t) is analogous to the determination of the Hurst exponent on the GPU. The input time series, which is transferred to the GPU's main memory, does not contain prices but price changes. However, in addition one needs the results for p(t) . Furthermore, evidence is given that the process reaches a slightly super-diffusive region (H ≈ 0.525) on medium timescales (2 4 time ticks < t < 2 7 time ticks). On long timescales, an asymptotic random walk behavior can be found. In order to quantify deviations from calculations on a CPU, the relative error (see main text) is presented for each time lag t in the inset. It is typically smaller than 10 −3 . and p(t) 2 . For this purpose, an additional array of length T is allocated, in which a GPU kernel function stores the squared values of the time series. Then, time series and squared time series are reduced with the same binary tree reduction process as in section 3. However, as this procedure produces arrays of length t max , one has to summarize these values in order to obtain p(t) and p(t) 2 . The processing times for determining the autocorrelation function for t max = 256 on CPU and 8800 GT can be found in figure 8. Here, we find that allocation and memory transfer dominate the total processing time on the GPU for small values of α and thus, only a fraction of the maximum acceleration factor β ≈ 33, which is shown as an inset, can be reached. Using the consumer graphics card GTX 280, we obtain a maximum speed-up factor of roughly 55 for t max = 256 and 68 for t max = 512 as shown in figure 9. In figure 10, the autocorrelation function of the FGBL time series is shown. At time lag one, the time series exhibits a large negative autocorrelation, ρ( t = 1) = −0.43. In order to quantify deviations between GPU and CPU calculations, the relative error is presented in the inset of figure 10. Note that small absolute errors can cause relative errors up to three per cent because the values ρ( t >1) are close to zero.
For some applications, it is interesting to study larger maximum time lags of the autocorrelation function. Based on our GPU implementation one has to modify the program code in the following way. So far, each thread was responsible for a specific time lag t.  In a modified ansatz, each thread is responsible for more than one time lag in order to realize a maximum time lag, which is a multiple of the maximal 512 threads per block. This way, one obtains a maximum speed-up factor of, e.g. roughly 84 for t max = 1024 using the GTX 280. In order to quantify deviations from calculations on a CPU the relative error is presented for each time lag t in the inset. The relative error is always smaller than 3 × 10 −2 .

Fluctuation pattern conformity
As a third method of time series analysis, the recently introduced fluctuation pattern conformity (PC) determination [50] was migrated to a GPU architecture. The PC quantifies pattern-based complex short-time correlations of a time series. In context of financial market time series, the existence of complex correlations implies that reactions of market participants to a given time series pattern are related to comparable patterns in the past. On medium and long timescales, one can state that no significant complex correlations can be measured because the price process exhibits random walk statistics. However, if one investigates the trading process on a tick-by-tick basis, evidence is given for recurring events. In the course of these considerations, a general pattern conformity observable is defined in [50], which is not limited to the application to financial market time series. In general, the aim is to compare a current pattern of time interval length t − with all possible previous patterns of the time series p(t). The current observation time shall be denoted byt. Then, the current pattern's time interval measured in time ticks is given by [t − t − ;t). The evolution after this current pattern interval-the distance tot is expressed by t + (see below)-is compared with the prediction derived from all historical patterns. However, as the standard deviation of the price process is not constant in time, all comparison patterns have to be normalized with respect to the current pattern. For this purpose, the true range is used-the difference between high and low within each interval. Let p h (t, t − ) be the maximum value of a pattern of length t − at timet and analogously p l (t, t − ) be the minimum value. Thus, we can create a modified time series, which is true range adapted in the appropriate time interval, through , as illustrated in figure 11. Figure 11. Schematic visualization of the pattern conformity estimation mechanism. The current patternp t − t (t) and the comparison patternp t − t−τ (t − τ ) have the maximum value 1 and the minimum value 0 in [t − t − ;t), as shown by the filled rectangle. For the pattern conformity calculation, we need to analyze for each time difference t + whether the current pattern value and the comparison pattern value att + t + is above or below the last value of the current patterñ p t − t (t − 1). If both are above or below this last value, then +1 is added to the non-normalized pattern conformity ξ χ ( t + , t − ). If one is above and the other below, then −1 is added.
In order to assess the match of a pattern with a comparison pattern, the fit quality Q t − t (τ ) between the current pattern sequencep t − t (t) and a comparison pattern ;t) has to be determined by the summation of the squared variations through Note that Q t − t (τ ) takes values in the interval [0, 1] as a result of the true range adaption. With these elements, one can define a pre-stage of the PC, which is not yet normalized, as motivated in figure 11, by with τ * =t −τ ift −τ − t − 0 and τ * = t − else. In general, we limit the evaluation for each pattern to maximalτ historical patterns. Furthermore, for the sign function, we use the standard definition sgn (x) = 1 for x > 0, sgn (x) = 0 for x = 0, and sgn (x) = −1 for x < 0. In (8), the parameter χ weighs pattern terms according to their qualities Q t − Processing times for the calculation of the pattern conformity on GPU and CPU for t − max = t + max = 20. The GTX 280 is used as GPU device. The total processing time on GPU is broken into allocation time, time for memory transfer, and time for main processing. The acceleration factor β is shown as inset. A maximum acceleration factor of roughly 19 can be obtained.
of current and comparison pattern sequences aftert for a proposed t + relative top t − t (t − 1), is given by By normalizing (8) through its altered version, in which the sign function is replaced by its absolute value, the pattern conformity can be written as We repeat that the pattern conformity is the most accurate measure to characterize the short-term correlations of a general time series. It is essentially given by the comparison of subsequences of the time series. Subsequences of various lengths are compared with historical sequences in order to extract similar reactions on similar patterns.
In order to realize a GPU implementation of the pattern conformity provided in (10), one has to allocate memory as for the Hurst exponent and for the autocorrelation function determination in sections 3 and 4, respectively. The allocation is needed for the array containing the time series, which has to be transferred to the global memory of the GPU, and for further processing arrays. The main processing GPU function is invoked with a proposed t − and a givent. In the kernel function, shared memory arrays for comparison and current pattern sequences are allocated and loaded from global memory of the GPU. In the main calculation, each thread handles one specific comparison pattern, i.e. each thread is responsible for one (in percent) between calculation on the GPU and CPU ( CPU χ=100 ( t − , t + ), with the same parameter settings). The processing time on the GPU was 5.8 h; the results on the CPU were obtained after 137.2 h, which corresponds to roughly 5.7 days. Thus, for these parameters an acceleration factor of roughly 24 is obtained.
value of τ and so,τ = γ × σ is applied with γ denoting the scan interval parameter and σ denoting the number of threads per block. Thus, γ corresponds to the number of blocks. The partial results of ξ χ ( t + , t − ) are stored in a global memory based array of dimension τ × t + . These partial results have to be reduced in a further processing step, which uses the same binary tree structure as applied in section 3 for the Hurst exponent determination.
The pattern conformity for a random walk time series, which exhibits no correlations by construction, is 0. The pattern conformity for a perfectly correlated time series is 1 [50]. A maximum speed-up factor of roughly 10 can be obtained for the calculation of the pattern conformity on the GPU and CPU for t − max = t + max = 20, T = 25 000, χ = 100 and σ = 256 using the 8800 GT. In figure 12, corresponding results for using the GTX 280 are shown in dependence of the scan interval parameter γ . Here, a maximum acceleration factor of roughly 19 can be realized.
With this method, which is able to detect complex correlations of a time series, it is also possible to search for pattern conformity based complex correlations in financial market data, as shown in figure 13 for the FGBL time series. In figure 13(a), the results for the pattern conformity GPU χ=100 ( t − , t + ) are presented withτ = 16 384 calculated on the GTX 280. One can clearly see that for small values of t − and t + large values of GPU χ =100 are obtained with a maximum value of roughly 0.8. For the results shown in figure 13(b), the calculation of the pattern conformity is executed on the CPU, and in figure 13(c), the relative absolute error is shown, which is smaller than two-tenths of a per cent. This small error arises because the GPU device summarizes only a large number of the weighted values +1 and −1. Thus, the limitation to single precision has no significant negative effect for the result.
This raw pattern conformity profile is dominated by trivial pattern correlation parts caused by the jumps of the price process between best bid and best ask price-the best bid price is given by the highest limit order price of all buy orders in an order book and analogously the best ask price is given by the lowest limit order price of all sell orders in an order book. As performed in [50], there are possibilities for reducing these trivial pattern conformity parts. For example, it is possible to add such jumps around the spread synthetically to a random walk. Let p * φ be the time series of the synthetically anti-correlated random walk created (ACRW) in a Monte Carlo simulation through p * φ = a φ (t) + b(t), which was used in sections 3-5 as synthetic time series. With probability φ ∈ [0; 0.5] the expression a φ (t + 1) − a φ (t) = +1 will be applied and with probability φ a decrement a φ (t + 1) − a φ (t) = −1 will occur. With probability 1-2φ the expression a φ (t + 1) = a φ (t) is used. The stochastic variable b(t) models the bid-ask spread and can take the value 0 or 1 in each time step, each with probability 0.5. Thus, by changing φ, the characteristic timescale of process a φ in comparison to process b can be modified.
Parts of the pattern-based correlations in figure 13 stem from this trivial negative autocorrelation for t = 1. In order to try to correct for this, in figure 14 (an animated visualization can be found in the multimedia enhancements of this publication), the pattern conformity of the ACRW with φ = 0.044, which reproduces the anti-correlation of the FGBL time series at time lag t = 1, is subtracted from the data of figure 13(a). Obviously, the autocorrelation for the time lag t = 1, which is understood from the order book structure is not the sole reason for the pattern formation conformity, which is shown in figure 13(a). Thus, evidence is obtained that financial market time series show pattern correlation on very short timescales beyond the simple anti-persistence which is due to the gap between bid and ask prices.

Conclusion and outlook
In this paper, we applied the compute unified device architecture-a programming approach for issuing and managing computations on a GPU as a data-parallel computing device-to methods of fluctuation analysis. Firstly, the Hurst exponent calculation was presented performed on a GPU. These results of the scaling behavior of a stochastic process can be obtained up to 80 times faster than on a current CPU core and the relative absolute error of the results obtained from the CPU and GPU is smaller than 10 −3 . The calculation of the equilibrium autocorrelation function was also migrated to a GPU device successfully and applied to a financial market time series. In this case, acceleration factors up to roughly 84 were realized. In a further part, the pattern formation conformity algorithm, which quantifies pattern-based complex short-time correlations of a time series, was determined on a GPU. For this application the GPU was up to 24 times faster than the CPU, and the values provided by the GPU and CPU differ only in a relative error of maximal two-tenths of a per cent. Furthermore, we could verify, that the current GPU generation is roughly two times faster than the previous one. The presented methods were applied to an FGBL time series of the Eurex, which exhibits an anti-persistent regime on short timescales. Evidence was found that a super-diffusive regime is reached on medium timescales. On long timescales, the FGBL time series complies to random walk statistics. Furthermore, the anti-correlation at time lag one-an empirical stylized fact of financial market time series-was verified. The pattern conformity which is used is the most accurate measure to characterize the short-term correlations of a general time series. It is essentially given by the comparison of subsequences of the time series. Subsequences of various lengths are compared with historical sequences in order to extract similar reactions on similar patterns. The pattern conformity of the FGBL contract exhibits large values up to 0.8. However, these values also include the trivial auto-correlation property at time lag one, which can be removed by the pattern conformity of a synthetic anti-correlated random walk. However, significant pattern based correlations are still exhibited after correction. Thus, evidence is obtained that financial market time series show pattern correlation on very short timescales beyond the simple anti-persistence, which is due to the gap between bid and ask prices. Further applications of the GPU-accelerated techniques in context of the Monte Carlo simulations and agent-based modeling of financial markets are certainly well worth pursuing. As already mentioned in the introduction, the main advantage of general-purpose computations on GPUs is that one does not need special-purpose computers. Although GPU computing opens a large variety of possibilities, the recent development of using graphic cards for scientific computing will perhaps also revive special-purpose computing as GPU implementations are not appropriate for each problem.