Multifractal Hopscotch in Hopscotch by Julio Cortázar

Punctuation is the main factor introducing correlations in natural language written texts and it crucially impacts their overall effectiveness, expressiveness, and readability. Punctuation marks at the end of sentences are of particular importance as their distribution can determine various complexity features of written natural language. Here, the sentence length variability (SLV) time series representing Hopscotch by Julio Cortázar are subjected to quantitative analysis with an attempt to identify their distribution type, long-memory effects, and potential multiscale patterns. The analyzed novel is an important and innovative piece of literature whose essential property is freedom of movement between its building blocks given to a reader by the author. The statistical consequences of this freedom are closely investigated in both the original, Spanish version of the novel, and its translations into English and Polish. Clear evidence of rich multifractality in the SLV dynamics, with a left-sided asymmetry, however, is observed in all three language versions as well as in the versions with differently ordered chapters.


Introduction
One of the perspectives on research on natural language is the one derived from complexity science [1,2]-as multiple traits identify natural language as a complex system [3].Numerous methods widely used in studying complex systems, based on concepts originating in information theory [4][5][6], time series analysis [7][8][9][10][11], network science [12][13][14][15][16][17][18] and the theory of power-law probability distributions [19][20][21][22], have been used to grasp the various quantitative characteristics of natural language.Understanding the mechanisms behind such characteristics has the potential to improve the methods of natural language processing and generation.This is especially relevant today, given the heightened focus on large language models (LLMs) [23,24], incorporated in generative AI systems like OpenAI's ChatGPT, Microsoft's Bing Chat, and Google's Bard.One of the language properties recently investigated with statistical methods is the usage of punctuation in written language-it has been demonstrated that the distribution of word counts between consecutive punctuation marks in literary texts generally follows a discrete Weibull distribution [3,25,26].What is interesting and significant, however, is the fact that for different languages the two parameters of this Weibull distribution are, to a large extent, selectively different and specific to a given language.Consistent with this fact, the punctuation distributions of texts after translation into another language are still governed by the Weibull distribution but with the parameters of the target language [3,25].Interestingly, the patterns of sentence-ending punctuation marks, such as periods, exhibit much more flexibility and are less strictly bound by this distribution.These intervals, equivalent to sentence lengths, can therefore display a more diverse range of patterns.
Among the works considered important in the world literature, a book interesting from the aforementioned quantitative perspective is Hopscotch by Julio Cortázar.At its core, this novel challenges traditional narrative structures and invites readers to engage with the text in a non-linear manner.The novel presents multiple reading options: one can either follow the conventional order of chapters or choose to explore the narrative via a suggested "hopscotch" order, jumping between chapters in a non-sequential fashion [27,28].This experimental approach allows readers to experience the story in a unique, personalised way reflecting the author's desire to break away from the conventional storytelling.As a consequence of this free-form structure of the novel, the order of sentences and smaller textual constituents depends on a chosen sequence of chapters, which implies a strong impact on long-range correlations, both the linear and nonlinear ones.Since the novel comprises an eclectic mix of styles, including passages of stream-of-consciousness and experimental language play, the overall richness and depth of the narrative results in a convoluted, multilevel linguistic construct that, if transformed into numbers, is expected to reveal substantial complexity with self-similarity being one of its likely manifestations.

Materials and Methods
In its original Spanish version [29], Hopscotch consists of 155 chapters that can be read in the printed order, the "hopscotch" order recommended by the author, in which the number of the subsequent chapter is explicitly given at the end of the previous one, or in any other order formed by permutating the chapters that is chosen by a reader.Because of such alternatives, all foreign-language translations preserve the structure of the novel and the author recommendations.To illustrate this point, along with the Spanish text, there are two translations considered as well: the English one [30] and the Polish one [31].We transform the text in each language version into a sentence length variability (SLV) time series by counting words between each two consecutive sentence-ending punctuation marks (these can be periods, interrogation marks, exclamation marks, and other equivalent marks).The related statistical data are collected in Table 1.The SLV data reflect the evolution of the size of the main functional building blocks throughout the whole novel, which has been shown to be substantially informative [25,32].Punctuation may be viewed as a process of interrupting an otherwise ceaseless stream of words that improves understanding of the message and allows a reader to have necessary breaks.These functions make occurrences of the punctuation marks suitable for the survival analysis [33].It has already been documented in the literature that the inter-mark distances develop distributions that can be modelled by the discrete Weibull distribution [3,25].This distribution is given by the following probability mass function [34]: and the following cumulative distribution function: The latter describes the probability that the random variable takes on a value greater than k.
The discrete Weibull distribution can be considered as a generalisation of the geometric distribution with probability p to which it conforms if β = 1, i.e., if the probability F (k) is constant in time.If β > 1, the probability increases with time while the opposite is true for β < 1.The discrete Weibull distribution is used in various fields, including survival analysis, weather forecasting, and study of textual data [33,35,36].In the case of punctuation, f (k) is the probability that a mark will occur exactly after k words.
Self-similarity or a lack of characteristics scale is among the most significant properties of natural complex systems.From an empirical perspective, this property can manifest itself in a non-trivial temporal organisation of measurement outcomes formed as a time series.In particular, complexity is often associated with a cascade-like hierarchy of data points that reveal multiscaling, which makes practical methods of identifying this sort of organisation a valuable tool in complex systems research [37].Our experience so far allows us to consider the multifractal detrended fluctuation analysis (MFDFA) as the most reliable method in this respect [38,39].This method constitutes a multiscale generalisation of an earlier, commonly used detrended fluctuation analysis (DFA) [40].
Here, the essential steps of the MFDFA algorithm are briefly sketched.Let U = {u i } T i=1 be a time series of T consecutive measurements of some observable u.We partition it into M s non-overlapping windows of length s starting from both ends of U, which gives us 2M s such windows.In each window, we eliminate potential non-stationarity of the signal by applying a detrending procedure to an integrated signal (the so-called profile) X = {x i } s i=1 whose elements have the following form: The detrending is carried out with the help of a polynomial P (m) of order m (we use m = 2 throughout this study) that is best-fitted to X in each window ν = 0, . . ., 2M s − 1 and the variance of the resulting detrended signal is then calculated: A family of fluctuation functions of order q is defined in the next step based on the average variance across all the windows: where q is a real number.The functions F q (s) have to be calculated for a number of different values of scale s and index q.Typically, minimum s is chosen above the length of the longest sequence of constant values of U and maximum s is T/5.In contrast, there is no typical range of q.This index can be viewed as being related to the moments of the signal, so the extreme values cannot be chosen as too large for a time series with heavy tails.
If the fluctuation functions depend on s as power laws: for a number of different choices of q, it indicates that the time series under study is either monofractal (when h(q) is constant in q) or multifractal otherwise.The function h(q) is called the generalised Hurst exponent, because for q = 2, h(q) = H, where H is the standard Hurst exponent [41,42].In terms of visualisation, fractal F q (s) form linear plots on double logarithmic plots.A convenient way of presenting the multifractal property of data is singularity spectrum f (α).It can be derived from h(q) by the following formulae: where α is a measure of data-point singularity equivalent to the Hölder exponent and f (α) is the Legendre transform of h(q).The function f (α) can be interpreted geometrically as a fractal dimension of the subset of the whole data set with the Hölder exponent equal to α [43].For a monofractal time series the pair (α, f (α)) is a single point, while for a multifractal one it is a parabola with shoulders pointing down.The usefulness of this representation comes from a fact that the broader the singularity spectrum is, the richer is the multifractality of the time series, which can be viewed as a measure of time series complexity.Sometimes the parabola of f (α) is distorted and asymmetric, which indicates that the data points of different amplitude have different scaling properties [44][45][46][47].
Alternatively, the scaling type may be expressed in terms of the multifractal spectrum τ(q) defined by For monofractal time series, τ(q) depends linearly on q (because h(q) is constant then), while it is nonlinear for multifractal ones, which allows one for an easy detection of multiscaling.

Linear Correlations
It is instructive to look into data structure before any further step is made.In the left column of Figure 1, time series of SLV in three different language versions of the text are shown, each of them representing an unsigned process with a heavy tail.The SLV time series representing the text with two alternative chapter orders, the one recommended by the author and a completely random one, are shown in the right column (top and middle panels) of the same figure.In both cases, there is a clear data clustering observed (long sentences tend to group with the long ones, while small sentences tend to group with the small ones), the time series are characterised by memory.This by no means differs from other pieces of literature where temporal memory is naturally present.It stems from the fact that sentence lengths are often connected with writing style and certain features of narrative, which usually last for some paragraphs or pages (for instance, longer sentences are often used in descriptions and slow-paced parts of texts, while short sentences are more typical for fast-paced parts and dialogues).This property vanishes if the sentences are shuffled instead of the chapters.Plots of the Pearson autocorrelation function (ACF) for the two authorial chapter orders are presented in Figure 2. ACF shows statistically meaningful values over a range of up to 200-400 consecutive sentences, after which it reaches noise level in both chapter orders.A trace of power-law decay of ACF can be noticed because of the double-log scale of the plots.Roughly the same picture is obtained if a random permutation of chapters is considered.This means that there is no difference between the proposed chapter orders and any other order that can be preferred by a reader (it must be noted, however, that it is impossible to consider all possible orderings as we deal with 155! ≈ 10 273 possible permutations here).As it might be expected, the sentence-level randomisation kills any genuine structure of SLV.

The Weibull Analysis
For all three language versions of Hopscotch, histograms for the respective SLV time series have been calculated; they are shown in Figure 3 (dark colours) together with the best-fitted discrete Weibull distribution defined by Equation (1).It is evident that there is no agreement between theory and the empirical distribution in each case-the result that matches results of similar analyses of different texts already reported in the literature [3,25].A much better agreement between the two can be seen for a different type of time series (punctuation mark distance variability, PMDV), in which distances between consecutive punctuation marks have been considered without restricting the analysis to sentence endings.Thus, if also such marks as comma, colon, semicolon, dash, etc. are included in the study, the discrete Weibull distribution becomes a better model for the data (see light-colour histograms in Figure 3).For all the language versions, the values of β for the PMDV time series exceed 1, which can be viewed as the normal case for written language, in which the probability of using a punctuation mark after a given number of words since the previous instance of writing a mark (the hazard function) increases with this number.It distinguishes Hopscotch from Finnegans Wake by James Joyce, where an abnormal value β < 1 has been found [48].In contrast, for the SLV data, β ≈ 1 has been obtained for two out of three languages; it corresponds to the exponential distribution of such data.This is better seen if the plots are semi-logarithmic (see the insets in Figure 3), where the exponential decay of the histogram tails and the corresponding fits are represented by straight lines.

Generalised Hurst Exponents
The classic Hurst analysis of time series allows one to quantify linear autocorrelation, as it expresses how the observed value span depends on the length of observation [41].In its generalised version h(q), it allows to detect differences in memory effects between fluctuations of different magnitudes [49].It naturally enters the multiscale analysis through Equation (6).h(q) is a non-increasing function of its argument with the special case being a constant function h(q) = H for all qs.In a practical situation, the range of considered values of q must be limited and dependent on the probability distribution function of the underlying stochastic process.Typically, the reference is that the qth moment of this distribution has to exist.In the present case, the related probability mass function is close to exponential, therefore no specific theoretical limit for q has to be applied.Because of this freedom, the parameter q has been limited to −7 ≤ q ≤ 7, which is a reasonable compromise between the range width, which should be as large as possible, and the available sample size.Figure 4 shows the generalised Hurst exponents for the three languages and the two authorial chapter orders.In each case h(q) is a decreasing function which indicates that the time series are heterogeneous in their scaling behaviour for different qs.The larger the variability of ∆h = h(q max ) − h(q min ) is, the more heterogeneous they are.However, as 0.37 ≤ ∆h ≤ 0.47 the difference between the cases is more quantitative than qualitative.It is not surprising, because, on the one hand, by changing the order of chapters, one effectively preserves the autocorrelation range up to the average number of sentences per chapter and, on the other hand, good translations into foreign languages are expected to preserve the structure of the original language version as closely as possible.The generalised Hurst exponent h(q) for the SLV time series representing Hopscotch in different languages: Spanish (top), English (middle), and Polish (bottom).The printed order (left) and the recommended order (right) of chapters are shown separately.The difference between extreme values of h(q) for −7 ≤ q ≤ 7 is given by ∆h in each case.

Multifractal Analysis
The family of the generalised Hurst exponents h(q) is closely related to the Hölder exponents α and the singularity spectrum f (α) via Equation (7).From the perspective of the amount of information that one can extract from the data, the f (α) representation is more natural and convenient than h(q), especially if a fractal analysis is carried out.However, both h(q) and f (α) are of little use without inspecting of the fluctuation functions F q (s) first.This is because they carry information about the possible existence of fractal scaling over a sufficient range of scales s, which is a crucial characteristic of the data.Therefore, Figures 5-7 show the family of F q (s) calculated in the range −7 ≤ q ≤ 7 for different languages and different chapter orders.(Main) q-dependent fluctuation functions F q (s) calculated for time series of SLV in the Spanish original text of Hopscotch (top) and its two translations into English (middle) and Polish (bottom).The printed order of chapters is considered.The plots of F q (s) for particular values of q are indicated by arrows.(Side) Singularity spectra f (α) associated with the exponents h(q) calculated from the scaling regions of the respective functions F q (s) for three intervals: q ∈ [−2, 2] (top right), q ∈ [−4, 4] (middle right), and q ∈ [−7, 7] (bottom right).In each case the asymmetry coefficient A α and the singularity spectrum width ∆α are also given.Let one start with the printed order of chapters; the corresponding fluctuation functions for different values of q are displayed in Figure 5 (main panels) for the three languages considered in this work.All the functions show a power-law dependence (approximately straight lines on double logarithmic plots) over some range of scales, whose right limit reaches typically 300-400 consecutive sentences for both positive and negative qs.The singularity spectra derived from the scaling exponents h(q) for different intervals of q are shown in the side panels of each main panel of Figure 5.The spectra have their maximum at α ≈ 0.8, which points out to a strong persistence in SLV.By going through these side panels from top to bottom, the range of the values of q gradually extends from −2 ≤ q ≤ 2 to −7 ≤ q ≤ 7. The rationale behind considering different spans of q is that there is a certain degree of asymmetry in the density of the lines related to the functions F q (s) between positive and negative qs in the main panels.Indeed, if one looks at the values of the asymmetry index A α for the f (α) spectra, it assumes a minimum value for the most narrow range of q and it increases with the increasing |q|.This observation is valid for all the three languages even though particular values of A α can differ from each other for the equivalent intervals of q in different languages.The asymmetry is left-sided, which means that the multiscaling comes predominantly from large fluctuations in SLV.This type of asymmetry is more often seen in empirical data than their opposite counterpart [46].As regards the widths ∆α of the f (α) spectra, they increase with |q|, which is a normal behaviour in such a context, but their value indicates that the SLV time series under study show the multifractal scaling even for the most narrow range of q.This result has been expected, since one of our past studies focused on books with a stream-of-consciousness narrative [32].Now, the order of chapters can be altered to agree with the one recommended by Cortázar.This is his "hopscotch" order, which allows a reader for viewing the book's plot from a different perspective.The respective results in terms of F q (s) and f (α) are documented in Figure 6, whose structure is the same as the structure of Figure 5.By altering the order, one destroys the temporal structure of the SLV time series for scales above the average length of a chapter expressed in the number of sentences, while the structure on shorter time scales remains largely unaltered.This leads to a visible significant distortion of the behaviour of F q (s) for the scales s ≈ 10 3 with respect to the printed order of chapters, which tends to be inclining towards monofractality there (a narrow beam of almost parallel lines).In contrast, for the scales inside the chapters, the changes are less evident.By looking at the singularity spectra obtained for the recommended order, one observes a systematic increase in both the asymmetry A α and the width ∆α if compared to the results for the printed order.This effect suggests an existence of a richer variety of the singularity strengths expressed by the Hölder exponents α in the considered time series.The origin of this behaviour remains unclear at the present stage of research, however.Together with the increased left-right asymmetry of the spectra, a slight shift of the maximum of f (α) in the direction of larger values of α can be identified for the recommended order (α ≈ 0.85).This seems to be a systematic effect for the three languages.Finally, the change in A α is related to this shift, because together with the unchanged value of α min they cause the elongation of the left branch of f (α), while the length of the right branch remains roughly the same in both situations despite the fact that α max is larger for the recommended order than it is for the printed order.
It is interesting to investigate whether such effects can also be observed for any random permutation of chapters.Figure 7 shows the results of the multifractal analysis of the SLV data corresponding to such permutations.The results for two individual permutations are shown there together with the average results for 1000 different random permutations.Only the results for the original Spanish version of the text are shown; the remaining two language versions are characterised by quantitatively similar results.Randomisation of the chapter order brings results that do not differ much from the results for the recommended order: for both individual permutations, the spectrum width ∆α is elevated with respect to its value for the printed order (0.65 and 0.70 vs. 0.56) but it is comparable to the value for the recommended order (0.65 and 0.70 vs. 0.65).Regarding the asymmetry index A α , the situation is not so evident as in the case of ∆α, because the results for one of the random permutations and for the printed order are close to each other (0.49 vs. 0.46), even though a larger discrepancy is also possible (0.61 vs. 0.46).These numbers refer to the maximum span of q (−7 ≤ q ≤ 7).It is also instructive to compare the results for individual chapter orders with the average over many random permutations (see Figure 7 (bottom)).
The significance of the results reported above has also been tested against the null hypothesis of no correlation.This has been performed by a conventional surrogate testing [50], in which the SLV time series are randomised at the individual sentence level, which destroys memory.Results show that in this case the singularity spectra are shifted to smaller values of α and the maximum of f (α) is located near α = 0.5.Even though ∆α for such surrogate data remains as high as 0.2, the inherent properties of the MFDFA procedure suggest that f (α) has been broadened spuriously by the finite-size effects related to the heavy-tailed probability distribution functions of the SLV data [51,52].Another surrogate testing based on the Fourier-phase randomisation of time series, which destroys all nonlinear correlations [53], brings the expected results, i.e., the singularity spectra become point-like and located exactly at α ≈ 0.5.Both types of surrogate tests confirm the statistical validity of the presented outcomes.

Summary
The time series of SLV representing Julio Cortázar's novel Hopscotch in its original Spanish version and in two foreign-language translations were analysed by means of the autocorrelation function, the Weibull analysis, and the MFDFA.Long-range memory effects were observed in terms of a power-law decay of ACF that extended beyond the scales defined by the average chapter length for the printed order of chapters and for other possible chapter orders, including the one recommended by the author.The fluctuations of the SLV time series were distributed in agreement with an exponential distribution, at least to a certain extent (small fluctuations somehow deviated from this pattern).For a comparison, the time series of PMDV, where all punctuation marks were considered, were modelled by the discrete Weibull distribution with β > 1-the value that did not differ from its standard values for written texts.Finally, by using the MFDFA formalism, the fractal properties of the SLV time series were studied.They happened to be multifractal for both the printed and recommended orders, as well as for all the considered random orders.The main difference between the results obtained for different chapter orderings was the observation that all the conceptual orders (i.e., the non-printed ones) showed an enhanced variety of singularities (i.e., broader singularity spectra) present in the time series and expressed by the Hölder exponents.They also developed stronger left-sided asymmetry as compared with their counterparts for the printed order of chapters, the effect produced by a subtle but noticeable shift of the maxima of the spectra towards larger values of the Hölder exponent.It has to be stressed that multifractal structures in literary works are a rare property that have been identified only in a small fraction of texts [3,32].Interestingly, this and the other considered properties seem to be invariant under translation into foreign languages if a translator pays sufficient attention to the structure of the original (see also [21,25,26]).
The results obtained in this analysis indicate that the printed order of chapters in Hopscotch cannot be distinguished statistically if compared to either the recommended order or a generic order obtained via a random permutation of chapters.Also, we do not observe any qualitative difference between the recommended order and generic orders, which leads us to a conclusion that neither the printed order nor the recommended order introduce significant temporal correlations on scales above the average number of sentences in a chapter, at least such correlations that could be identified with the methodology applied here.From this perspective, each chapter constitutes a largely independent block whose position in the narrative can be arbitrarily changed without changing the statistical properties of SLV.
If the results of this study are compared with the earlier results for Finnegans Wake by J. Joyce, one observes that Hopscotch is more conventional not only in terms of literary style but also in terms of the statistical properties [3,32,48].Its role in the world's literature stems from its innovative construction including the nonlinear and reader-engaging narrative rather than the structural and lingual complexity that is so striking in Finnegans Wake or Ulysses.Nevertheless, the results of this work provide an inspiration for more systematic research of the most prominent and unique literary works, as these are the ones that are potent to catalyze opening new perspectives in linguistics, including those in the field of large language models.In the latter context, it is important that the methodology presented here allows for the identification and quantification of various types of correlations in written texts, including long-range ones.In the terminology of nonlinear dynamics this means a reduction in the effective dimensionality of the corresponding 'phase space' and therefore a reduction in the actual number of parameters involved.In LLMs, nowadays commonly based on neural networks, such facts can be exploited to significantly, and perhaps even crucially, reduce the number of applied parameters and thus increase their efficiency in terms of time and energy consumption.
A related prospective direction of future research, somehow parallel yet substantially distinct from the multifractal approach discussed here, is a dynamically oriented analysis of SLV, in which one looks for low-dimensional attractors in a reconstructed phase-space of "text-writing dynamics".Apart from some early attempts [11,54], this direction still remains largely unexplored.

Figure 1 .
Figure 1.Time series of sentence length variability (SLV) measured in words for the original Spanish text of Hopscotch (top) and its translations into English (middle) and Polish (bottom).In order to ensure sufficient readability and preserve the vertical scale in all panels, the bars representing a few sentences with excessive length in each text have been rescaled by a factor of 1/3 (black arrows).Time series of SLV are presented in alternative orders of chapters: the printed order (left) and the order recommended by the author (right).

Figure 2 .
Figure 2. Autocorrelation function for SLV time series for two chapter orders: the printed order ("-p", left) and the recommended order ("-r", right), and for three languages: Spanish (ES, top), English (EN, middle), and Polish (PL, bottom).Noise level is indicated by a red dashed line in each panel.Note the double logarithmic scale of the plots.

Figure 3 .
Figure 3. Histograms of the SLV time series (dark colour) for three different language versions of Hopscotch: Spanish (top left), English (top right), and Polish (bottom) together with the histograms of the distances (in words) between any two consecutive punctuation marks (both the sentence-ending ones and the intra-sentence ones, light colour).Discrete Weibull distributions that have been leastsquare fitted to the histograms of both types are also shown in each panel (dashed and solid lines, respectively).In each case, values of the p and β parameters of the fits are explicitly given in legend boxes.(Insets) Histogram of the SLV time series with the fitted discrete Weibull distribution presented on the half-logarithmic scale.

Figure 4 .
Figure 4.The generalised Hurst exponent h(q) for the SLV time series representing Hopscotch in different languages: Spanish (top), English (middle), and Polish (bottom).The printed order (left) and the recommended order (right) of chapters are shown separately.The difference between extreme values of h(q) for −7 ≤ q ≤ 7 is given by ∆h in each case.

Figure 5 .
Figure5.(Main) q-dependent fluctuation functions F q (s) calculated for time series of SLV in the Spanish original text of Hopscotch (top) and its two translations into English (middle) and Polish (bottom).The printed order of chapters is considered.The plots of F q (s) for particular values of q are indicated by arrows.(Side) Singularity spectra f (α) associated with the exponents h(q) calculated from the scaling regions of the respective functions F q (s) for three intervals: q ∈ [−2, 2] (top right), q ∈ [−4, 4] (middle right), and q ∈ [−7, 7] (bottom right).In each case the asymmetry coefficient A α and the singularity spectrum width ∆α are also given.

Figure 6 .
Figure 6.The same functions as in Figure 5 for the recommended order of chapters.

Figure 7 .
Figure 7.The same functions as in Figure 5 for the Spanish version of Hopscotch with two random permutations of chapters (top and middle) and the average taken over a set of 1000 sample permutations (bottom).

Table 1 .
Essential statistics on the SLV data considered in this study.