Supplementary Information-Identification of Literary Movements Using Complex Networks to Represent Texts

The use of statistical methods to analyze large databases of text has been useful to unveil patterns of human behavior and establish historical links between cultures and languages. In this study, we identify literary movements by treating books published from 1590 to 1922 as complex networks, whose metrics were analyzed with multivariate techniques to generate six clusters of books. The latter correspond to time periods coinciding with relevant literary movements over the last 5 centuries. The most important factor contributing to the distinction between different literary styles was {the average shortest path length (particularly, the asymmetry of the distribution)}. Furthermore, over time there has been a trend toward larger average shortest path lengths, which is correlated with increased syntactic complexity, and a more uniform use of the words reflected in a smaller power-law coefficient for the distribution of word frequency. Changes in literary style were also found to be driven by opposition to earlier writing styles, as revealed by the analysis performed with geometrical concepts. The approaches adopted here are generic and may be extended to analyze a number of features of languages and cultures.


Introduction
Many findings related to language and culture issues have been made with the use of statistical methods to treat large amounts of texts [1,2,3,4]. Recent examples are the analysis of millions of books [1] and the study of twitter messages, where the global variation of mood could be observed through textual analysis of tweets [2]. In several of such examples knowledge is inferred from the analysis of semantic contents in the texts. There are also other methods to analyze text, including cases where text is represented as a graph (or network) [5]. Of particular relevance was the finding that networks formed from texts are scale free [6], whose topology could be analyzed leading to various contributions. For instance, the scale-free structure (which is analogous to the Zipf's Law frequency distribution [7]) of text networks emerged as a consequence of an optimization process for both hearer and speaker, so that the effort to transmit and obtain a message was minimized [8]. In addition to allowing for cultural features to be identified and explored, automatic analysis may be useful for real-world applications, such as automatic text summarization [9], machine translation [10,11], authorship attribution [12], information retrieval [13] and search engines [14].
In this study we used topological metrics of complex networks representing text from 77 books dating from 1590 to 1922 in an attempt to verify changes in writing style. With multivariate statistical analysis of the metrics obtained, we were able to identify periods that correspond to major literary movements. Furthermore, we established which network characteristics were responsible for the changes in writing style.

Pre-Processing
The modeling process starts by removing punctuation and words that convey little semantic content (see the Supplementary Information (SI)-Sec.1), such as articles and prepositions. Then, the remaining words are transformed into their canonical form, i.e. nouns and verbs are converted into the singular and infinitive forms, respectively. This step is performed using the MXPOST part-of-speech tagger [15], which assists the resolution of ambiguities. The transformation to the canonical form (lemmatization) is done to cluster words referring to the same concept into a single node of the network despite the differences in flexion. At last, adjacent words in the written text are connected in the network according to the natural reading order (the left word is the source node and the right word is the target node). The modeling is demonstrated in Table 1 for the pre-processing steps, while Fig. 1 illustrates the network obtained from a small extract of the book Great Expectations, by Charles Dickens.

Complex Networks Measurements
Several metrics extracted from the networks were used to quantify the style of the books. From each local measurement (i.e., which refers to a node) we derived some quantities describing the distribution of the networks in order to quantify the style of whole books. The measurements and their corresponding distribution descriptors were chosen because they have been useful to quantify the style of texts in previous studies [12]. The simplest measurement refers to the number N of nodes in the network, which corresponds to the size of the vocabulary used to write the piece of text analyzed. The distribution of word frequency was characterized using the coefficient γ of the frequency distribution p k : where c is a normalization constant (see Fig. 2(a) for an example of the frequency distribution p k of a specific book). We did not verify explicitly whether the degree obeys a power-law distribution because k is proportional to the frequency of words.
Since the word frequency follows the Zipf's Law [16,17], the degree is guaranteed to obey a power-law distribution ‡. To compute γ, we employed a technique based on the accumulated distribution p k (see Fig. 2(b)) described in Ref. [18]. We also used the frequency of words (or equivalently the degree k of the nodes) to calculate the assortativity Γ [19,20,21] (or degree-degree correlation) of the network as: where M = 21, 900 § is the number of edges of the network and a ij = 1 if nodes i and j are connected and a ij = 0 otherwise. If positive values are obtained for Γ, then highly connected nodes are usually connected to other highly connected nodes, indicating that there may exist regions where nodes are highly interconnected [19]. Conversely, if Γ is negative then highly connected nodes are commonly connected to little connected nodes.
In addition to measurements based on the number of nodes of the network and on the degree, the distance between concepts was employed to characterize the structure of the books. This measurement, widely known in the theory of networks as average shortest path length l [22], is calculated from the distance d ij , which represents the minimum cost (minimum number of edges) required to reach node j, starting from node i. After computing all pairs of values d ij , the average shortest path length l i of each node i is: Since l i is defined for each node individually, the network is characterized by a distribution of l i (see the distribution of l i for a specific book in Fig. 2(c)). The distribution was characterized quantitatively by computing the average l and standard deviation ∆l. Additionally, we computed the weighted average (1/ k i ) k i l i ≡ l w , so that greater importance was given to the most frequent words in the text. The third moment ς(l) was also computed. ‡ The power-law distribution was verified for all texts of the database. § To avoid effects from the size of the books, for obtaining the complex network we used only the first M + 1 words of each book. The last metric was the clustering coefficient (C) [22], which quantifies the density of connections between the neighbors of a node i according to: The clustering coefficient in equation 5 represents the fraction of the number of triangles among all possible connected sets of three nodes, and therefore 0 ≤ C i ≤ 1. Similarly to the average shortest path length, it is also necessary to quantitatively characterize the distribution of the measurement (see an example of distribution of C in Fig. 2(d)). We therefore computed the average C , the standard deviation ∆C, the weighted average (1/ k i ) k i C i ≡ C w and the third moment ς(C) to characterize the distribution.

Database
The database comprises 77 books available online at the Gutenberg project repository [23], whose publication date ranged from 1590 to 1922. Tables S1-S3 in (SI)-Sec. 2 give the details of the books. The texts were represented with complex networks [8,9,10,11,24,25,26,27,28,29,30], in which the edges are defined on the basis of co-occurrence of words (see Sec. 2). The latter procedure has been proven suitable to quantify both the style and structure of texts (see e.g. Refs. [11,26,29]). The details of the procedures adopted to model texts as complex networks and a description of the measurements employed to characterize the networks are given in Section 2.

Results and Discussion
The evolution of literary styles was quantified considering the 11 measurements from complex networks described in Sec. 2.2 for the books from the Project Gutenberg [23]. The main measurements were the shortest path length (l), the clustering coefficient (C), the assortativity (Γ), the power law coefficient of the degree distribution (γ) and the size of the vocabulary (N ). An initial, arbitrary division of the books in 6 intervals of 50 years, according to their publication date, led to the clusters shown in the Canonical Variate Analysis (CVA, see details in (SI)-Sec.3) plot in Fig.  3. The distinction was relatively poor, especially considering the standard variation ellipses [31] in the inset of the figure. Good separation was only possible when distant periods in time were compared, as their ellipses did not overlap. This difficulty in distinguishing literary movements should perhaps be expected as there is no reason for sharp transitions to occur only because half century marks were reached. We also verified the distinguishability of clusters with the Principal Component Analysis (PCA, see (SI)-Sec.3), but the distinction was also poor. In order to verify whether books from distinct publication dates could be distinguished at all, we adopted a systematic procedure for the partition of the dataset using an optimization approach. This was performed by assessing the quality of the clustering under the condition that books with consecutive publication dates should belong either to the same cluster or lie in the boundaries of consecutive clusters. More specifically, we varied the delimiters and number of clusters in the database and quantified the quality of the clustering using 2 indices, viz. the simplified silhouette (SWC) and the Dunn index (DN) (see (SI)-Sec.4). Good distinction of writing styles was obtained for 3, 4, 5, 6 and 7 clusters (see Figure S1 of the SI), according to the two indices (SWC and DN). The best partition, which was found to be statistically significant (see Figure 4), was obtained with SWC and CVA projection, leading to the 6 clusters in Fig. 5, where there is almost no overlap among clusters, as shown in the inset. Most significantly, the 6 time periods inferred from this analysis coincide with well-established literary movements listed in Table 2. Other important features are inferred from Fig. 5. First, clusters for subsequent time periods are normally placed next to each other, indicating smooth changes in writing style over time. The same conclusion can be inferred from the analysis of the hierarchical clustering in Fig. 6 with the Wards [32] distance. The exception to this trend was the major change from the 1794 − 1818 → 1826 − 1906 period, which may  Figure 5. Because the silhouette for the random case SW C rand = 0.187 ± 0.036 is smaller than the silhouette SW C = 0.558 for the clustering of Figure 5, the clustering inferred is significant. The same applies for the Dunn index because DN rand = 0.059 < DN = 0.207.
be the consequence of a drastic change in style triggered by the French Revolution (1789). As for the variance among clusters, the lowest and highest values applied to the 1590 − 1653 and 1906 − 1922 periods, respectively. These results are intuitive as little change in style could be expected in older periods, while in the recent periods less uniformity could be the result of the coexistence of many writing styles.
The most important factors contributing to the separation of literary styles were determined in two distinct ways. The first technique considered a feature to be relevant if it was capable of providing significant distinction between groups, regardless of the other features. The list of metrics and the corresponding p-value for the difference of a given measurement between pairs of clusters are given in Table 3. The asymmetry in the distribution of the average shortest path length ς(l) and the vocabulary size N exhibited the most significant variations. Interestingly, similar results were reported in Ref. [12], where these two measurements were also useful to characterize personal writing styles. In the second evaluation, a feature was considered relevant if it was able to provide good distinction between groups based on the interdependencies of features. This evaluation was carried out by computing the importance of each measurement for the axes in the CVA plots. The results in Tables 4 and 5 point to the clustering coefficient (C and C w ) as the main factor for the distinction in 6 clusters. Since there is evidence that the clustering coefficient quantifies whether words are restricted to specific or generic contexts (an explanation of this property is given in Ref. [12]) , it seems that the extent of use of generic or specific words varied along history. This change has not been monotonic, as indicated in Fig. 7(a). In fact, most of the network measurements fluctuated over time, including the size of the vocabulary, whose considerable change was responsible for the most drastic transition, from the 1794 − 1818 → 1826 − 1906 periods. This is clearly illustrated in Fig. 7(b). The only metric with a well-defined trend over time was the coefficient of the power law for the scale-free networks representing the texts. The decreasing trend in Fig. 7(c) points to a smoother, and therefore more uniform, frequency distribution, which means that the difference in frequency between low and high-frequency words decreased with time.
The changes in style between any two consecutive clusters appeared to have been driven by opposition [40] (see Appendix A), which quantifies the extent into which the current period can be thought of as an opposite movement to the previous literary movements. The coefficient satisfies the inequality W ij > 0, with the exception of the 1826 − 1906 → 1909 − 1922 transition. Furthermore, the opposition movement was more significant than the skewness movement s ij (see Appendix A), which quantifies how much the change in the current style deviates from the opposition movement. The    Figure 7.
Dynamics of (a) average clustering coefficient; (b) vocabulary size; and (c) coefficient of the power law. While the clustering coefficient and the vocabulary size oscillate throughout the periods, the coefficient of the power law tends to decrease, which shows that words were used in a more uniform way in the later periods.
results are given in Table 6. In other words, the innovation of style ( − → v i , see definition in Appendix A) was generally driven by contrasting the previous styles ( − → a i , see definition in Appendix A). As for the dialectics ρ ijk (see Appendix A), which quantifies how the current movement i is an implication of the two previous movements j and k, no clear pattern could be identified in Table 7. The lowest ρ ijk (and therefore with the highest dialectics) appeared during the 19th century. Thus, realism is a literary style that better In subsidiary studies we verified that the complex network metrics used are indeed efficient in distinguishing styles. For that we examined the writing style dynamics of 10 books ¶ of Charles R. Darwin (1809-1882) and Edith Wharton (1862-1937), whose styles are known to differ considerably. Indeed, this is confirmed in the CVA plot in Fig. 8, ¶ The list of books is shown in Table S3 in (SI)-Sec.2. where again the most contributing factor for distinction was the clustering coefficient C, since both C and C w are responsible for 44 % of the weights in the first canonical variable axis.

Conclusion and further work
Changes in the writing style could be studied objectively by analyzing the metrics from complex networks representing texts from books published over several centuries. Significantly, the most appropriate clustering of books matched the traditional literary classification, with the most contributing factor for distinguishability being the average shortest path length. We found it to be possible to distinguish literary movements using only the vocabulary size or the asymmetry of the average shortest path length distribution. Innovation in writing style was found to be driven mainly by opposition, with growing trend of literary development toward counter-dialectics. Interestingly, these findings represent the generalization of previous results where a dependence was established between network topology and style of machine translations [10,11] and style of authors [12]. We believe that the approach used here may be useful to study the evolution of any system of interest, since the basic concepts (i.e. characterization through features and use of time series) are completely generic.
As future work, we plan to employ additional complex network measurements in a larger database to verify if the discrimination can be further improved. We shall also examine the relationship between semantics and topology, by generating clusters using the semantics of words to be compared with the clusters obtained from the analysis of network topology. A more challenging endeavor will be to extend the study to other languages, in order to probe whether the patterns revealed in this paper can be generalized. Figure 9. Illustration of the quantities employed to define the opposition, skewness and counter-dialectics indices.
The dialectics between three consecutive styles i, j = i + 1 and k = j + 1 = i + 2 in the temporal series was quantified as follows. If − → v k is the outcome of a synthesis of the styles represented by − → v i and − → v j , then the distance d ik between − → v k and the middle line M L ij defined by − → v i and − → v j (see Fig. 9(a)) will be small. The counter dialectics index + ρ ik is: Further details regarding the definition of the opposition W ij , sknewness s ij and counter-dialetics ρ ik are given in Ref. [40]. + Note that we referred to ρ ik as counter dialectics index instead of dialectics index because it is defined as a distance. Hence, there is an inverse proportion between ρ ik and the concept of dialectics.