Fictionality

The distinction between fiction and non-fiction, between a text that is true and one that is not, is one of the oldest on record. Ever since we have been thinking about the act of narration, we have addressed the related meanings of truth and imagination. This is what Aristotle designated as the difference between the communicative use of language (legein) and its creative use (poiein). For millennia, we have been debating whether there are inherent features of being fictional or whether it is simply a matter of intention, that perhaps there is nothing unique to the language of fictional discourse after all. How do we know when a text is signalling that it is "true" or, by extension, not true? And what might quantity have to tell us about this most elementary of distinctions?

"There is no textual property, syntactical or semantic, that will identify a text as a work of fiction. " -John Searle The distinction between fiction and non-fiction, between a text that is true and one that is not, is one of the oldest on record. 1 Ever since we have been thinking about the act of narration, we have addressed the related meanings of truth and imagination. This is what Aristotle designated as the difference between the communicative use of language (legein) and its creative use (poiein). 2 For millennia, we have been debating whether there are inherent features of being fictional or whether it is simply a matter of intention, that perhaps there is nothing unique to the language of fictional discourse after all. How do we know when a text is signalling that it is "true" or, by extension, not true? 3 And what might quantity have to tell us about this most elementary of distinctions?
Consider for example the following two passages: A On the short ferry ride from Buckley Bay to Denman Island, Juliet got out of her car and stood at the front of the boat, in the summer breeze. A woman standing there recognized her, and they began to talk. It is not unusual for people to take a second look at Juliet and wonder where they've seen her before, and sometimes, to remember. B Jeff is 24, tall and fit, with shaggy brown hair and an easy smile. After graduating from Brown three years ago, with an honors degree in history and anthropology, he moved back home to the Boston suburbs and started looking for a job. After several months, he found one, as a sales representative for a small Internet provider. He stays in touch with friends from college by text message and email, and still heads downtown on weekends to hang out at Boston's "Brown bars. " "It's kinda like I never left college, " he says, with a mixture of resignation and pleasure. "Same friends, same aimlessness. " At first glance, these passages share a good deal in common. Both use single proper names (Jeff / Juliet) and markers of place (Boston / Denman Island). They each use a number of pronouns (her, they / he), an occasional adjective (short ferry ride, summer breeze / shaggy brown hair, easy smile), as well as the past and present tense (stood, began, is / is, started, moved). While the second passage uses dialogue, it is not unreasonable to assume that the first text might at some point, too. And both passages seem to offer some kind of psychological underpinning to the description, whether it is Jeff 's malaise as a "sales rep" or the woman who recognizes Juliet in the first passage. Both make claims on our thinking about people and personality.
And yet few readers would have difficulty guessing that passage B is from a work of non-fiction (Michael Kimmel's Guyland) and passage A from a work of fiction (Alice Munro's "Silence"). What is it then that makes this so obvious? We could say, as many in our field do, that fictionality is simply ineffable, that it is a matter of a feeling we get when we read. Who could say what conjures imaginary worlds in readers' minds? Or, on the other hand, we could try to be more pre-cise than my initial description above and quantify as many features as possible about these two passages in order to understand where their salient differences lie at the lexical and syntactic level (Table 1). Looking at these passages in this way, we see not only a broader range of differences between them, but also the strength, and presumably, the significance of these differences. Munro's text now looks more verb-ish than Kimmel's, with more present tense verbs relative to the overall number of words. Her sentences are also somewhat longer, and more pronoun heavy. She uses more articles and prepositions than Kimmel and also vocabularies of insight ("recognize, " "remember, " "wonder"), tentativeness ("sometimes, " "wonder"), human-centeredness ("woman, " "people"), and mobility ("car, " "ride, " "ferry, " "boat"). Kimmel, on the other hand, uses many more six-letter words, slightly more commas and periods, more numbers, and a greater vocabulary of affect and work. We might think the literary text would be more affective, but part of Munro's art is submerging feelings so that they are more implicit than explicit ("and wonder where they've seen her before, and sometimes, to remember").
This article is about understanding the differences between fictional and nonfictional texts, the signs that signal to readers when a story is true or not-true.
Rather than look at a single example as I have done above, or even several of them, I will be using a collection of roughly twenty-eight thousand documents, both fictional and non-fictional, to better understand what distinguishes fictional writing from its non-fictional counterpart. Much of my emphasis will focus on the novel as one of the dominant forms of fictional writing from the nineteenth century to the present. Beginning around 1800, when we know the novel began its inexorable quantitative rise to prominence, what makes the novel unique as a form of fictional discourse? 4 Questions about the nature of fictional speech reached something of a high-point in the 1970s and early 80s, with numerous works in the philosophy of language reflecting on the linguistic cues that marked out a text's truth claims. 5 At stake in this endeavor was an attempt to define and thus potentially control for the reliability of language, the ability to distinguish between the truthful and untruthful content of speech. The work of John Searle became a landmark within this movement, providing a framework that was deeply indebted to a theory of speech acts inherited from J.L. Austin.
For Searle and the community of philosophers gathered around him, the differences between fictional and non-fictional discourse did not depend on the actual content of the speech. Instead, it depended on the combined intentionality of the speaker and receiver, what were known as illocutionary and perlocutionary acts.
(We might think of these as frameworks for producing and receiving speech.) As Searle writes, "The utterance acts in fiction are indistingiushable from the utterance acts of serious discourse, and it is for that reason that there is no textual property that will identify a stretch of discourse as a work of fiction. " 6 For the philosophers of language, fictionality was not a distinct use of language, but depended on the intentions of both writers and readers and the way those intentions were communicated beyond the boundaries of a text.
For literary theorists of about the same time, literature, as a subset of fictional discourse, similarly came to be defined as an indistinguishable textual entity from the larger category of "writing. " Searle's and Austin's speech-act theory was used to generate a more general critique of literary essentialism, that there were unique and potentially timeless qualities to works of literature. As Jaques Derrida wrote, explicitly invoking Searle's philosophy, "No exposition, no discursive form is intrinsically or essentially literary before and outside of the function it is assigned or recognized by a right, that is, a specific intentionality inscribed directly on the social body. " Derrida would then continue, "This is the hypothesis I would like to test and submit to your discussion. There is no essence or substance of literature: literature is not. It does not exist. " 7 For Derrida and much of the poststructural criticism that followed, literature was the product not of a definable set of features, but a social set of intentions, the frameworks of production and reception that underpinned Searle's speech acts. 8 Translating Searle's position on discursive statements into literary interpretation more generally, literature was seen as liberatory precisely because it was irreducible to any kind of pattern, habit or idiolect. 9 This article makes a very different claim, one that is based on observing a great deal of instances in which individuals have engaged in fictional or non-fictional writing over the past two centuries. Seen from this perspective, fictionality emerges as a highly legible category at the level of linguistic content ("lexis" in Aristotelian terminology). Such legibility is what allows us to build predictive models that can identify works of fiction with greater than 95% accuracy, and it should be added, that allow human readers to do the same (as in my initial experiment above). Contrary to the beliefs of the philosophers of language or different schools of literary critics from poststructuralists to postclassical narratologists, truth claims in language (or their opposite fictionality) are a  9 Hardly a thing of the past, this position is now being replayed in the field of "postclassical" narratology, which argues that there are no inherent distinguishing features between fictional and nonfictional narratives. Driven largely by new work in the theory of mind, attention is paid not to the unique features of texts but the cognitive apparatus that is brought to bear on these texts and that is assumed to be common across all kinds of narration. As I will show, not only are fictional narratives highly distinct from non-fictional ones, but their differences are most strongly driven by an attention to sense perception, that is, to a sense of embodiment, making a strict reliance on cognition as narrative's primary framework problematic. For the postclassical narratological position, see J. Alber and M. Fludernik, eds., Postclassical Narratology (Columbus: Ohio State UP, 2010). This new work is driven largely as a response to the "classical" narratological work of Dorrit Cohn, The Distinction of Fiction (Baltimore: JHU Press, 1999) and even earlier Käthe Hamburger, The Logic of Literature (Bloomington: Indiana UP, 1973). highly recognizable linguistic aspect of texts. What appeared to be the case at the level of the sentence or "utterance" (what Searle rather vaguely called a "stretch of discourse"), no longer holds when we observe writing at a different level of scale. 10 Placing all of the emphasis on the reader's activity, whether as cognitive predisposition or interpretive freedom, overlooks the powerful and extensive ways that texts mark themselves for their readers according to their fictional nature.
Not only does the research here suggest that fictionality is a highly legible category, but it also appears to have been surprisingly stable for at least two hundred years. While there have undoubtedly been significant changes to the way we tell stories, when we use learning algorithms trained on nineteenth-century texts we can still recognize contemporary novels with an impressive degree of accuracy (about 91%), even if that performance does decrease (history still matters). Indeed, the very features that seem to indicate the uniqueness of novels in the nineteenth-century, for example, appear either to be increasing over time or largely holding steady, even among a diverse range of genres into the twentieth and twenty-first century. While it remains an open question as to the extent to which different genres exhibit these features of fictionality to a similar degree, my initial research suggests that there is a surprising degree of commonality across very diverse types of fictional writing. Such continuity has important and still largely unaddressed implications for how we think about both genre and literary periodization. 11 Understanding the legibility of fictionality -the extent to which it marks itself off as a cultural practice -has important implications for understanding our own disciplinary practices. Recent emphasis on the historical imbeddedness of creative writing, however valuable, in many ways misses the point of such coherence, the way one of the driving concerns of fiction is to differentiate itself from other kinds of writing. This is not to suggest that fiction is not in some basic sense about the real world, but it does suggest that its center of gravity, what Kleist called a work of art's Schwerpunkt in his essay on the Marionette Theater, is located somewhere else. What I ultimately hope to better understand here then is this center of gravity, the ways in which fiction distinguishes itself as a kind of writing. As we will see, fiction's stability, and the novel's in particular, appears to be based on what we might call "phenomenological investment. " 12 The particular nature of the novel's contribution to fictional discourse in the nineteenth century (and beyond) has been its concern not simply with the world around us, but our perceptual encounter with that world, one that includes a great deal of skepticism, prevarication, and negation. To experience the world in the novel is first and foremost to doubt.
To think about fictionality and the novel in these terms inevitably puts pressure on some of the more common scholarly refrains of the recent past. Longstanding questions about the novel's "realism, " i.e. the extent to which and the means through which a novel reproduces a given environment, give way in this view to the novel's concerns with a "dramatization of encounter" -both with others and with the world, indeed, with otherness more broadly speaking. Rather than think of the novel as a genre based on its relationship to a knowledge of things, thingtheory serving here as a translation of the rise-of-realism into new terms, according to its quantitatively meaningful components the novel appears far more selfreferential in nature, offering us access to the knowledge of knowing. It renders explicit an experience of the hypothetical, a testing relationship to the world. 13 While this may have traditionally been how we have thought about a small subset of modernist experiments, it is significant that this insight holds even across the most canonical "realist" novels of the nineteenth century.
This article thus suggests that certain canonical positions within the history of novel scholarship need to be rethought or at least subject to revision in light of an emerging computational understanding of the novel. Whether it is Catherine Gallagher's argument about the novel's ambiguous relationship to its own fictionality; the poststructural investment in literature's negativity, as when critics speak of the novel as the "genre of no genre"; thing-theory's emphasis on what Elaine Freedgood has called the "denotative, literal, and technical" language of the novel; or Ian Watt's still-influential position on the novel's referentiality, as when he writes, "It would appear then that the function of language is much more largely referential in the novel than in other literary forms"; computation presents a very different portrait of the novel's importance as a type of fictional discourse. 14 The point is not that these positions are unfounded -it is certain that for some novels these ways of being may indeed be predominant just as it is certain that for many novels these ways of being may be operative some of the time.
But if we try to understand what makes the novel stand out from other types of ostensibly true writing or even other types of fictional texts -if we try to generalize about the novel as a genre -then at least since the turn of the nineteenth century we are seeing something altogether different at stake. According to the research I will present here, the novel's mattering is not primarily grounded in its positive representation of the world, that is, in its mimetic utility, its ability to simulate something (as in seventeenth-century debates about vraisemblance). Nor is it grounded in a kind of post-structural negativity -the novel is unrecognizable as a distinct and stable category, a reflection of literature's more general negative capability. Rather, the novel can best be described through its invest-ment in the negation of the certainty of its own worldliness. It is grounded in an appeal to encounter rather than reality. In doing so, it is precisely the referentiality of language that is being bracketed in the novel, not ambiguously, but programmatically, even in novels that are widely considered to be the most realistic.

Prediction and Description
This article will be using a combination of what are called predictive and descriptive methods. Predictive models, such as those employed in the process of machine learning, are important because they allow us to engage in the process of classification, of what it means to define a group of texts as a coherent entity and to understand the degree of coherence according to certain predefined conditions. 15 Predictive models allow us to say with how much certainty we can identify texts that belong to a specific group and under what criteria. The more certainty there is, the more cohesive the category is thought to be.
Descriptive models, on the other hand, are useful because they allow us to qualify distinctive features of one group when compared with another without engaging in the act of classification. They can tell us which features are distinctive of one group versus another, but they do not do so in order to make claims about the overall uniqueness of that group. Instead of defining a text or group of texts -the novel is X or the novel is this predictable -these qualities help describe the behavior of a group according to more individualized criteria. This too is valuable because it allows us to understand the components that make one group different from another but that do not necessarily lead to categorical difference. Explaining predictions -how a computer arrived at an estimate about which class a text belongs to -is quite challenging. Explaining individual differences is far more straightforward. It is their combination, I would argue, that allows us to think both categorically -about the relative coherence of writing under certain conditions -as well as qualitatively about the specific aspects of writing regardless of drawing definitive boundaries around things. Description is in many ways much closer to the traditional task of literary criticism than prediction.
The data that I will be using for this article has been selected to understand the nature of fictionality across different types of writing. 16 The aim is to see if the results here hold across time, different languages, and different sample sizes. Over-all, the data consist of~28,000 documents, dating from the late eighteenth century to the early twenty first written in both English and German. The collections contain different kinds of fictional and non-fictional writing, including novels, histories, philosophy, advice books, novellas, fairy tales, and classical epics translated into prose among other kinds of writing (though not including encyclopedias or cook-books). Together, the texts can be grouped into four principal categories.
The first collection represents a canonical set of nineteenth-century writing of 600 documents in both German and English curated by my lab. These include the best-known novels from the period as well as well-known non-fiction, including philosophy, essays and histories. These texts have been sufficiently cleaned so that they are subject to minimal transcription errors and also broken out by point of view so that we can control, for example, for third person novels when comparing to historical narratives.
The second collection is a much larger sample of nineteenth century writing in English consisting of 21,158 documents, both fictional and non-fictional, that are drawn from Ted Underwood's research using the Hathi Trust archives. 17 This group, whose contents are much less well understood, allows us to test our results between a canonical subset and a much broader group of writing from the same period. The third component consists of a collection of~6,500 novels drawn from both the Stanford Literary Lab's nineteenth-century novel collection and the Chicago Text Lab's collection of twentieth-century novels. Together, these collections allow us to examine diachronic shifts in the novel's vocabulary. Finally, the fourth component consists of a collection of 800 contemporary novels and non-fiction published within the past decade curated by my lab. 18 This gives us some traction on the extent to which the effects we are seeing in the past continue to hold in the present. Table 2 provides an overview of the different components that will be used and the respective number of documents. 17 The designations of fiction and non-fiction are derived from Ted Underwood's collection, which is located here: https://dx.doi.org/10.6084/m9.figshare.1281251.v1. All duplicate titles were removed, all documents with the word stems for "essays, " "tales, " "scenes" or "stories" in either the genre or title fields were removed, and only those works with an 89% or better chance of containing more than 80% pages of fiction were chosen.
18 Similar to the .txtLAB nineteenth-century collection, the documents here represent a canonical representation of fiction and non-fiction of the past ten years, meaning they have passed through some kind of filter, whether it is the New York Times Book Review, a literary prize competition short list, or appeared on various bestseller lists of platforms like Amazon.com or the New York Times. For a more thorough review of the data set and its insights into contemporary forms of social value surrounding the novel, see Andrew Piper and Eva Portelance, "How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading, " Post-45 05.10.2016: http://post45.research.yale.edu/2016/05/how-cultural-capital-works-prizewinningnovels-bestsellers-and-the-time-of-reading/. The features that I will be exploring will be drawn largely from the LIWC software designed by James Pennebaker. For those not familiar with LIWC, it is a tool developed in the social sciences for studying large text collections. It consists of eighty different features that range from the identification of syntactic and grammatical features like the use of punctuation, prepositions, verb tense, and pronouns to higher-level cognitive phenomena like social, perceptual, or emotional processes. 19 These are dictionaries that have been tested and validated on human subjects, the results of which are available for review. 20 As with all lexicon-based approaches, there are open questions as to the semantic coherence of a given category. Are all of the words in the "insight" dictionary really indicative of moments of cognitive insight in novels? Or do all instances of "I" mean the same thing? 21 The interpretation of these results thus needs to be handled with a good deal of caution. In particular, care needs to be taken in how the categories are understood as categories, with emphasis given to assessing the semantic coherence of these categories within the novels. When we drill down into the individual features, what do we find? As we will see, much of my emphasis will be on the less semantically ambiguous categories, from punctuation and pronouns to verbs of sensory experience or cognitive prevarication.
At the same time, it is also important to emphasize the benefits of such lexiconbuilding approaches for the computational study of literature. Unlike the problems faced by topic modeling or word embeddings, where collections of words are discovered and labels applied after the fact, here we start with prior assumptions about linguistic categories and test their presence within given text collections.
The externality of the words from the collections they are meant to study allows one to test beliefs independently of the collections themselves. While neither approach is perfect, in both cases we are moving between individual words and the ideas those words are thought to embody. What is ultimately at stake is the confidence of a model to approximate some underlying textual phenomenon. 22 The key is making as explicit as possible how we move between these different levels of analytical scale, that is, how we connect the conceptual, the lexical, and the theoretical.
LIWC can thus give us an initial range of intuitively meaningful interpretive categories to build on as well as the lexica upon which those categories may in part be based. One of its principle advantages is that the feature sets can be shared even on proprietary data, as I have done here. Nevertheless, the categories should not be taken at face value, but looked into, as with all semantic fields. Because the dictionaries are transparent in LIWC, users can refine or alter the dictionaries as they see fit, as I have done on occasion here. 23 They can also be combined with other, more customized features, as I have also done here, for example by looking at word classes using a tool like WordNet. While future work will want to continue to expand and refine these kinds of feature sets, the LIWC collection can serve as a useful starting point for any supervised approach to understanding the quantitative dimensions of texts.

The Coherence of Fictionality
The question that I want to begin with is, How coherent is fiction as a type of writing? Are there indeed no syntactic or semantic properties, as Searle contends, that allow us to predict whether something is intended fictionally? Is fictionality exclusively a function of communicative context, the intentionality of the writer and the belief-system of the reader? Or are there features that appear with a high 22 On modeling and computational hermeneutics, see Andrew Piper, "Novel Devotions: Conversional Reading, Computational Modeling and the Modern Novel, " New Literary History 46.1 (2015): 63-98. 23 All dictionaries and modifications are included in the dataset released with this article: degree of regularity in fictional texts that do not appear in non-fiction such that even a computer can make accurate guesses as to the nature of the text?
In order to answer these questions, I will use the process known as machine learning to see how accurately a computer can predict a text's given class (I will be using the learning algorithm known as a support vector machine (SVM), which is applied in many text classification scenarios). 24 For those not familiar with this process, a learning algorithm is "trained" on features found in a set of documents for which the classes are already known and then asked to predict which class a group of texts belong to that it has not seen. In this case, I train the algorithm using the LIWC features discovered in a given set of documents and use a process of 10-fold cross-validation to make predictions on whether a document is a work of fiction or non-fiction. What this means in practice is that I randomly divide the corpus into a 90-10 split ten times, where 90% of the documents are used to train the algorithm and the unseen 10% are used to test its reliability. (The folds function in the kernlab package ensures that the folds are equally balanced between the two categories.) Doing this 10 times allows us to gain a full view of all of the documents in the collection as each document has an opportunity to be in the test set. Table 3 presents the results of this experiment, showing which two data sets were compared and the average accuracy score of the predictions on the unseen data.
As we can see in Table 3, not only are the differences between fiction and nonfiction robust across time and languages, but we can use models built in one time period to strongly predict those of another. While there is a clear drop in performance when we use nineteenth-century models to predict twenty-firstcentury novels, we can still see a relatively high degree of performance at work here (around 91% accuracy). There appears to be a notable degree of diachronic stability to fictional discourse over the past two centuries. Indeed, as we will see below, when we examine features that are more indicative of novelistic writing in particular as a subset of fictional discourse, we generally see these features increase over time. The trans-temporal stability of the novel is complemented by an increase in certain types of novel-specific vocabulary that can be traced back to the nineteenth century.

The Phenomenology of the Novel
If fiction is so predictable, what are the features or combinations of features that make it so? While there are many ways one might want to approach this question, here I rank features by the increase of their median frequency in fiction versus non-fiction and use a non-parametric Wilcoxon Rank Sum Test to indicate statistical significance. The value of comparing medians is that it preserves information about the overall distribution of a feature in a given sample. Rather than lump all works of fiction together into a single bin, where some works may have significantly higher amounts of a feature than others and thus skew the results in their favor, the median identifies the mid-point of a given category. Second, where some significance tests are designed to avoid the importance of lowerfrequency features, this test does not make that assumption. It looks exclusively at the ratio between the two populations to understand how much more frequent a given feature is relative to its overall occurrence. Low-occurring features that increase a great deal are thus considered more important than high-occurring features that increase by a smaller amount. 25 To begin, let me review the overall structure of the tables used here to better understand what they can and cannot tell us (Table 4). The left most column ("Feature") lists the features as defined by LIWC. Some are extremely straightforward ("exclamation" refers to the percentage of exclamation marks), while others 25 It is entirely possible to reverse this assumption and privilege features that are more prevalent overall, something that becomes valuable when assessing individual words. This is the approach I use in Table 11 for example. Individual words can have such low frequencies that favoring words with higher counts assures that one is finding more "important" or perhaps "relevant" vocabulary, i.e. less random vocabulary. The crucial point is that the outcomes are determined by the initial assumptions used in the model. are more nuanced. "Family, " for example, refers to a dictionary of words all related to family members, while "social" relates to words having to do with social experience, which for example can include pronouns (a choice that effectively duplicates the pronouns categories because they are so much more common than other words). The former is arguably much more straightforward than the latter and thus we need to be cautious when we encounter a dictionary that is more semantically ambiguous (though even a single word like "you" may have different kinds of functions in novels). The second column ("Category") lists the category to which the feature belongs, a slightly more general framework for understanding the individual features. The next two columns (Fiction %, Non-Fiction %) present the median frequency of that feature in each corpus as a percentage of all words. This allows us to see which features are more prevalent relative to other features.
Because percentages are somewhat opaque in terms of a reader's experience, I will generally be translating these numbers into page and work equivalents in the discussion that follows. This allows us to imagine our way into a reader's experience and surmise which features occupy more of a reader's attention. Exclamation marks, for example, comprise on average about .45% of a given work of fiction in the nineteenth century. If we assume an average novel length of about 100,000 words (or 500 words per page across 200 pages), this means that there is one exclamation mark for about every 200 words, or 2-3 per page, or roughly 500 total per novel. Personal pronouns, on the other hand, occur about 10% of the time in fiction, which means once every 10 words, or 50 times per page (and 10,000 times per novel).
The fifth column, "Ratio, " lets us see how much more prevalent the feature is in one collection over another. Exclamation marks appear almost ten times as often in fiction than in non-fiction. This is a massive difference, but we are still only talking about something that occurs relatively infrequently compared to other features. Personal pronouns, on the other hand, only appear a little more than two-times more often in fiction (still a very large difference), but this increase is based on a much larger linguistic aspect of texts. Two times as many pronouns means roughly 5,000 more pronouns per work or about 25 more per page. While I privilege ratio here in my interpretation of the results, we will want to keep our eye on both of these aspects, from the overall prevalence of the feature to the relative increase from one population to another.
Beginning with the baseline comparison of fiction and non-fiction writing using both our canonical sample and the larger collection of Hathi Trust writings from the nineteenth century (Table 4), we see how the features that are most indicative of fictionality are driven by dialogue -exclamation marks, question marks, quo-tation marks, first and second person pronouns like "I" and "you, " assent words like "yes, " "okay, " and "oh, " and finally the word "said" (which is labeled as an auditory verb by LIWC). Importantly, we also see very strong alignment between the nineteenth-century sample and the larger population of Hathi Trust documents, with some notable exceptions around the "social" category and potentially "family, " "home, " and "ingestion. " If we compare these groups directly, we see that only "family" and "ingestion" are somewhat inflated in the canonical sample (by about 10-15%). 26 In other words, while there are interesting variations that are worth exploring, on the whole, the smaller sample does a good job of capturing the same information as the larger collection. Taken together, these features suggest a relatively unambiguous way in which fictional writing has a uniquely dialogical construction when compared with non-fiction. While this may not be "news, " it does help us build a taxononomy of the distinctions that make this kind of writing socially significant. Imagining people talking to each other appears to be one of fiction's primary cultural functions. Indeed, imagining people as people may be fiction's most important role. If we remove dialogue from the sets above, including the pronominal expressions that accompany them (she said, he cried, etc), 27 we see how third-person pronouns emerge as one of the strongest indicators of fictionality along with references to family members and bodies (Table 5). There is over a three-fold increase in the average number of she/he pronouns in fiction versus non-fiction outside of dialogue, with just these two words alone accounting for more than five-percent of all words in the text (or roughly 5,000 instances for a medium-length novel).
This is especially remarkable considering that on average works of history, for example, use considerably more proper names than works of fiction (an estimated more than 2x as many). 28 The lower number of people in fiction is compensated for by a more expanded durational existence on the page for which pronouns become key linguistic markers. People seem to have more extended identities in fiction, though this is not necessarily to be confused with a more "expansive" identity, i.e., one that is more semantically rich. The pronominal frequency of characters is not the same as the linguistic diversity surrounding these characters. Nevertheless, this gives us a first indication of the ways in which fiction performs the process of identification as a repetitive and extensive act of naming the same person.
The prevalence of family and friend vocabulary in fiction also suggests what type of people are more distinctive of the genre, just as the setting of home gives us an idea of where they are most active. Broadly speaking, when we read fiction in the nineteenth century, what is novel, i.e. different from other kinds of texts that purport to be about real things, is a focus on family and the familiar. Travel, adventure, work -these can be experienced elsewhere in ways that documentation of family life and the extended agency of individuals cannot. The stakes of this attention will become even clearer when we focus on a particular type of fiction (novels with external narrators) and a particular type of non-fiction (history writing) across both German and English text collections as well as across historical and contemporary data sets (Tables 6-8). What rises to the top here are a host of perceptual categories (seeing, hearing, feeling) that construct the phenomonological reality of an experiencing individual. And the greater prevalence of body words gives us an indication of where that attention most often lies. It is knowledge, not just of otherness, but of another embodied individual that most consistently frames the epistemological horizon of the novel from a quantitative point of view. This result poses an interesting challenge for "theory of mind" approaches that argue that fiction's primary purpose is the enactment of another human consciousness. 29 While we will see an area where this hypothesis does make sense in the next test, in terms of understanding the novel's distinctive fictional qualities the mind-body distinction that underlies theory of mind models does not hold up well in light of the novel's strong emphasis on sensorial input and embodied entities. The sensual experience of a sensing being: this is what appears to be uniquely reiterated in the imaginative work of novelistic writing when compared to its non-fictional counterpart.  . This data set is meant to represent a range of prose fiction that would have been widely read and known to nineteenth-century anglophone readers but would not have been considered a "novel. " While the material dates from different epochs, the publications (and translations) are all contemporaneous with the period as a whole. Three interesting features initially stand out in this table. First, the ratios are much lower when compared with non-fiction. While these groups are similarly well-differentiated when compared to non-fiction, when compared to each other the overall distinctiveness drops considerably. If we run the same classifier as above, we can predict novels with about 68% accuracy, which is close to the threshold of statistical significance (p=0.018). If we use a slightly larger collection of novels from the Hathi Trust collection (428 to mirror "other fiction"), accuracy will increase slightly to 74% (p=7.23e-05). This is still considerably lower for example than the ability to predict novels from different genres. As Ted Underwood has shown, it is possible to predict detective fiction and science fiction across a 150-year span with between 88-90% accuracy. 30 The broad category of "other fiction" then is not highly differentiated from novels as a subset of fiction.
Second, while we see some of our more familiar linguistic fictional markers such as pronouns and dialogue, we also see a new feature in the category of verbs.
There are more verbs overall as well as more varied tenses (past, future, present, in addition to auxiliary verbs). In other words, there appears to be greater temporal complexity to novels than can be found in fiction more generally. While this deserves its own study, it suggests an initial insight into one of the key ways that novels differentiate themselves from other kinds of imaginary writing in the nineteenth century. 31 Finally, we also see a new category emerge here that we have not seen before, one that falls under the heading of "cognitive process. " These are the dictionaries that LIWC labels "discrepancy, " "negation, " "tentativeness, " and "insight. " If we examine the words in those dictionaries that are most distinctive of novels (and here I rank by log-likelihood ratio), we can see the extent to which these are words that tend to mark out moments of self-reflection, doubt, and hesitation, a kind of testing-relationship to the world. 32 Modal verbs in particular are extremely prevalent here, could, would, must, might, and should, as well as their negative contractions, and so too is the act of negation more generally (don't, can't, didn't, not, never, nobody). As the presence of "if " suggests, these groups offer different ways of expressing conditionality or even impossibility. At the same time, indefinite words such as something, somebody, anything and anybody are more prevalent, along with a more specific vocabulary of hesitation (perhaps, chance, hope, possibly, guess, maybe, doubt, uncertain). In between the conditional and the impossible language of the novel, there lies a considerable amount of potentiality -chance, but also skepticism. 33 Finally, we see how novels are marked by a much stronger use of mental states, captured in major verbs such as know, feel, think, remember, and believe, along with a second layer of less frequent, but similarly distinctive complex cognitive verbs such as admit, ponder, imagine, and forgive (the latter not shown). This cycles-of-genres/. 31 This fiction/novel distinction that is translated across the axis of time will recapitulate itself in the genre distinctions of the contemporary novel as they relate to social value. One of the strongest ways bestselling and prizewinning novels in the present differentiate is across the feature of nostalgia and retrospection. See Piper and Portelance, "How Cultural Capital Works. " 32 Results are contained in the is the ground of the novel's reflectiveness, that which binds together doubt and conditionality into a consistent mental state. Indeed, the combination of seem and feel, both of which appear 30% more often in the novel, give us a particular indication of what I am calling the novel's phenomenological orientation. Not the world itself, but a person's encounter with and reflection upon that worldthe world's feltness -is what marks out the unique terrain of novelistic discourse when compared with other forms of classical fiction. It is this combination of sense perception plus cognitive skepticism that seems to bring out the novel's contribution to fictional discourse. The novel professes its uniqueness in the way it offers extended reading experiences of the human assessment of the world's givenness. 34 34 If we just condition on these four features, the classification results will outperform the average accuracy of four features chosen at random by a statistically significant margin, though to be sure the numbers are considerably lower than when we use all 80 features, and there are other combinations that will perform slightly better. Those feature combinations that do perform better most often contain present tense verbs and categories of sense perception, pointing to the other ways discussed here through which novels are unique. All Features = 76%, Prevarication Freatures = 63.4%, Random4 (100 trials) = 60.8% +-3.8%.

The Question of the Novel's Realism
If we can agree that one of the ways the novel distinguishes itself as a genre is through a more intensive attention to a phenomenological vocabulary, the question arises as to whether such attention is also accompanied by a greater degree of world-focusedness, that is, more attention to reality or what I would more abstractly describe as "givenness" following Lukacs. Are the phenomenological and the realistic mutually exclusive of one another or mutually constitutive? Can we test, in other words, the longstanding hypothesis of the novel's heightened realism?
While these are questions that deserve their own study, I offer two tests here that attempt to gain some insight into the validity of the novel's realistic tendencies. In my first test, I explore the novel's relative attention to abstraction versus physical entities. To do so, I compare a subset of nineteenth-century novels from the Chadwyck Healey collection (about 700 novels) with my "other fiction" from the Hathi Trust and translate these collections into their respective hypernym trees using Wordnet. 35 Hypernyms provide higher-order classifications of nouns ("furniture" is a hypernym of "chair, " for example) and can allow us to see whether particular categories are more present than others in a given corpus, much in the way LIWC does for emotional and psychological processes. So for example, if the word "marsh" appears in a novel, this would be translated into "wetland, " "land, " "ground, " "soil, " "object, " "physical_object, " "physical_entity, " and "entity. " All nouns in this case are "entities, " while their first order of distinction is between those nouns that are "physical, " like a marsh, and those that are abstract, like "death. " The question that this model allows us to pose is whether novels exhibit significantly higher amounts of physical entities when compared with other kinds of classical fiction (or histories for that matter). By doing so, we can gain confidence as to whether the novel's uniqueness is tied to its physical objectivity, one potential way of understanding its degree of realism, or whether it hinges more on abstract mental or emotional states.
The second test attempts to understand realism as a greater amount of specificity. The more specific a text is, the more focused it is on the world around it. To test this, I compare the percentage of words in one text that are hypernyms of another text's words (and vice versa). The more words from one group that are hypernyms of words from another, the more abstract that first group can be said to be and the more specific the second group. For example, if I use the word "marsh" and you use the word "land, " then my language can be said to be more specific than yours (and yours more abstract than mine). Unlike in my first test, where having more objects in a text is a way of thinking about an attention to realism, here the emphasis is on measuring a greater degree of specificity as a marker of the real.
The results suggest that the novel's relationship to its concreteness, when measured in this way, has indeed been changing over the course of the nineteenth century (Table 11). 36 If we look at the first half of the nineteenth century, we see how there is a greater degree of abstraction relative to physical objects when compared to classical fiction and tales, but that this difference disappears by the second half of the century. As Ryan Heuser and Long Le Khac have argued, the British novel experiences a decline of valuation and a rise of concretization over the course of the century. 37 And yet an important caveat to that finding is that while abstraction appears to be declining in the novel, it never drops below other kinds of fictional discourse from the period. Far from the Victorian novel being uniquely concrete, it is the early-nineteenth-century novel that looks uniquely abstract when compared to other kinds of fictional discourse from the period. It suggests that we have been potentially telling this story in reverse: what matters in the nineteenth century is not the later rise of concreteness, which looks more like other types of fictional writing, but the earlier abstractness of the novel, which stands out relative to other types of fiction (not to mention abstraction remains considerably more important to these texts overall than their physicality).
The other part of the table, that which concerns the novel's specificity, tells this same story in the other direction. Where the first half of the century witnessed little significant difference in the degree of specificity between the novel and other kinds of fiction (about a 0.003% difference), by the second half the century, the novel has about 0.5% more specific words than other kinds of fiction (or about 2 words/page). 38 In other words, the novel approaches other kinds of fiction in its degree of abstraction while it departs from other kinds of fiction in its degree of specificity. It begins the century uniquely abstract and finishes uniquely specific.
There are of course numerous other ways we might think about the novel's realism. But these initial results suggest that the Wattian thesis about the novel's realism, understood here as a greater degree of both physicality and specificity, appears, in the first instance, less as a story of exceptionality and more as a regression to the norms of fictional discourse more generally. In the second instance, where the novel does become exceptional in its specificity, this occurs much later than has been traditionally argued. The novel's earlier quantitative rise in the late eighteenth and early nineteenth century appears to be marked instead by a higher deegree of conceptual sophistication and generality, the novel's love of abstraction, far more than its presentation of the world around it. 39 As Matthew Erlin has argued, there is a philosophical dimension to the novel that is an important part of its history that we have so far overlooked. 40 The scholarly emphasis on the realist novel's concretization has missed one of the primary ways through which novels have historically mattered as a form of writing, i.e. in their abstractness. 38 One of the reasons for the extremely low p-value here is the high number of observations. Given that we are comparing every novel to every other novel across groups, we get over 13,000 observations. Even small differences can look significant. The important point is the actual difference in this case which changes from 0.003% to 0.5%. 39 This understanding of the novel's asbtractness as key to its earlier distinctiveness offers a different way of thinking about a critique of the realist hypothesis than, say, Wayne C. Booth. For Booth, the novel's significance, or for him the "good" novel's significance, is its ability to focus on dramatic intensity rather than realism. "The interest in realism is not a 'theory' or even a combination of theories that can be proved right or wrong; it is an expression of what men   . This graph measures the percentage of words in one corpus that are hypernyms of words in the other corpus. A hypernym is a more abstract representation than its hyponym. Error bars have been removed because they are too narrow to visualize.

Conclusion
In trying to distinguish fiction from non-fiction, in locating what makes fiction and the novel unique as types of writing, I have in the process been attempting to gain insights into their larger social function, to answer that perennial question of "why literature matters. " According to the results presented here, if we focus on the quantitatively distinct qualities of novels in particular -of what separates them off from non-fictional or "true" writing -we can say that the novel's mattering since the nineteenth century appears to be less a matter of social realism and more one of phenomenological encounter, a kind of social imbedding in the world. It is not that this is the only way novels have been or could be meaningful to readers. This is the problem of predictive versus descriptive approaches that I discussed at the outset -predictability forecloses other possibilities, other qualities that may be important. Descriptive models simply identify which features differ and by how much, without presupposing a limit to the feature space. The value of the quantitive point of view is that it allows us to better understand the way a particular type of writing signals to readers a particular orientation, what we might call its social positionality (via Bourdieu). This does not foreclose the myriad ways readers can find their own version of the novel's mattering. But it does allow us to better understand the novel as a social category.
Seen in this way, the fictionality of novels is special because of the notion of encounter and the questioning that comes with it, the way they put us as readers in the world in a particular way. Things seem and feel a certain way, just as there is a great deal of doubt and chance and perhaps and maybe. These findings appear to be robust across two different languages, two very different time frames, and across both a larger and smaller, more canonical sample of writing. As we have already seen, we can use models built on the behavior of the classical nineteenthcentury novel and still predict contemporary novels with a great deal of accuracy. If we look more closely at these features over time, we also see how they both increase and then eventually remain constant over a longer period of time ( fig. 1). The values that are put in place surrounding fictional discourse in the nineteenth century remain largely intact over the course of time, with some features, like an emphasis on sense perception rising considerably in importance. 41 41 Further analysis suggests that this increase is largely driven by a reliance on words for "site. " Most of the other senses remain flat. More research could uncover what this ocular bias suggests in terms of fiction. Figure 3. Rates of "doubt" and sense "perception" in English novels published between 1800 and 2000. In the former case we see a very slight rise by the second half of the nineteenth century and in the latter a continual rise that levels off in the postwar era.
Of course not all novels are like this and not all novels that are like this are like this in equal ways. The extent to which these markers of fictionality are consistent across different genres of novelistic writing still remains an open question. One can also imagine another study in which we could explore those moments when novels become highly non-fictional, to understand the truth within the imaginary. How might we characterize the non-fictional within the fictional and what purpose do these passages perform? It will come as little surprise that of the handful of novels misclassified by the algorithm used at the beginning of this article, Melville's Typee and Omoo are in this group (but not Moby Dick, suggesting canonization is picking-up at least in part on its fictionality). But so too is a novel like Behemoth: A Legend of the Mound-Builders by Cornelius Matthews, a novel about slaying an ancient Mastodon and argued to be a key influence on Melville. 42 Can we say with more specificity what function the informational use of language has within fiction -Aristotle's legein -not only in these novels, but within novels more generally? Such a study would try to provide a mirror image to this article's insights, a diptych in which we see something about the novel reflected in relief.
Most importantly for me though is the way the methods used here are not re-42 Curtis Dahl, "Moby Dick's Cousin Behemoth, " American Literature 31.1 (1959), 21-29. ducible to the single passage or well-turned sentence. There are hundreds of thousands of lines of novels that contain instances of our fictional markers, of pronouns, sense perception, and the subjunctive mood and negation. Each one of them is slightly different from another. Individually they are nothing special. Taken together, however, they signal a powerful message to readers. Fictionality is a feeling we get as readers from the likelihood of seeing all of these words flash across the page, words like "feeling, " "knowing, " "seeing, " "remembering, " "almost, " "possibly, " "vaguely, " and a variety of forms of negation, of the not. This is the space of fiction's apartness. "And where is this elsewhere?" Roland Barthes once asked. "In the paradise of words. "