READING AND REREADING: INSIGHTS INTO LITERARY EVALUATION

In this paper, we argue that the long-term evaluation of literary texts rather privileges their formal and structural aspects, but to a lesser degree the content. As external and internal reasons for literary evaluation are distinguished, the latter are privileged while the former will only last as the themes and the external reasons remain, which means they are transient and do not lead to long-term evaluation of literary products. At the same time, it is the second type of internal reasons – the form and structure of a literary work – that plays a role and contributes to canon formation. Hence we present the results of some experiments in which two different groups of participants (university students and older people in a convenience sample) in two cultural settings read three texts (one popular fiction and two canonical ones) several times and answered the three questions derived from the ―rereading paradigm‖ by Bortolussi & Dixon (2003). As an additional measure to check the participants reading habits, in Study 1 the ART test from Stanovich et al. (1995), adapted for a Ukrainian audience, was employed. The results, which did not corroborate the predictions, are discussed and confronted with some insights into the theory of literary evaluation.


Introduction Literary evaluation: a complex field
The evaluation of literary texts (or artworks in general) would seem a somewhat esoteric business. Something that a handful of critics engage in with little or no relationship to either the production or dissemination of literary texts. Nor would it seem to bother actual readers of literature. And indeed there are few academic studies that probe the mechanisms such evaluation goes through; exceptions are von Heydebrand & Winko (1996), Albers (2007) and van Peer (2008).
One may object that there has been a flood of publications dedicated to the -canon wars‖, but as we will argue in a moment, these can be seen as of ephemeral impact. In general, there is little consensus in literary studies why particular texts are better than others. A concomitant shortcoming is that we have little insight into what it is that drives the evaluation of literary texts.
This near-total absence of studies in literary evaluation is remarkable. For the evaluation takes place not only individually, but also on a macro scale, through the necessary unavoidable and inexorable coercion to select. Let us thereforeby way of preparation for a further discussionlook at some empirical data. Although there may be differences, in most western countries the production of new books is a welldeveloped industry. Take Dutch, for instance, a language spoken by roughly some 25 million people. The annual output of new titles, as indicated by Cultuurindex Nederland, runs around 35,000. These are all new titles, so not including new editions, reprints and the like. These include all kinds of books: biographies, cookbooks, comics, travel guides, etc. Of these, roughly some 35 % fall in the -fiction‖ category, which for our present purposes we may deem roughly identical to -literature‖: novels, short stories, plays, scenarios, essay, poetry, etc. That means that every year about 10,000 new literary works are produced in Dutch. Yes, that is about 100,000 new works in just ten years. But consider that Dutch is not a widely spoken language.
For English, the numbers are, of course, more staggering still. According to IPA, International Publishers Association, there we must reckon with some 700,000 new titles per year. (These are books published with regular publishing houses, thus not including self-publishing by authors, which would run up to some more 700,000 titles a year.) So some 250,000 new literary works are generated by regular publishing houses in the English-speaking worlda quarter of a million. Per year.
The relevance of these empirical data must be quite obvious: there is absolutely no way in which all these new literary works can be preserved for posterity. Even a fraction of it will put a heavy demand on space, finances, and time. Hence a selection will have to be made. And the numbers do not leave much doubt that this selection will simply be inexorable, even brutal, in which more than 90 % of all new literary works will be destroyed. Even if 10 % of the literary books annually published in English were to be preserved, that would be nearly impossible: it would amount to 70,000 new books per year to be purchased by libraries, classified, stored, etc. Not even the richest countries could afford such an enormous effort at storage. So select they will, from the very beginning. So evaluation is inevitable.
But how does this evaluation of what to retain and what to discard take place? Nobody knows. Indeed this is a staggering conclusion from the above observations: although a constant massive evaluative operation finds place in the literary system, there are no efforts by literary scholars to study, let alone to explain by what mechanisms these operations take place, nor what principles are involved. Here we propose some basic ideas on the matter, without, however, having the possibilities to probe each of them in detail.
We will distinguish between external and internal -reasons‖. By reasons we mean motives (groups of) people have to (positively or negatively) evaluate literary works. Note that evaluation, like emotion, is always characterised by a valence: positive or negative. In the same way as emotions are experienced as pleasant / unpleasant, evaluations have a dimensional orientation: they are either agreeable or disagreeable to our own preferences, values, norm systems, etc.
To look at the external reasons first, we may evaluate a literary work because of a variety of reasons that have little to do with the work itself. Maybe the author was on television, or her (attractive) picture is on the cover, or a friend of mine was highly positive about it. Or maybe it is prescribed by my religion or my ideological conviction. Or it is a work everyone should have read, they sayso it must be good. It will be obvious that for commercial success such external reasons are often of great importance. For instance, many people will positively evaluate a work if it is about something (dramatic) that really happened. At the same time, such external reasons cannot influence a work's evaluation over a long timeso that they will usually be ephemeral.
Usually, it is internal reasons that will promote a book (or forbid it). There are two such kinds of internal reasons: one has to do with what is traditionally called its content, the other with its form, or structure. The former evidently plays a role when we read a literary text: is it about a topic that interests us? Travel, psychology, mystery, or crime? It is without doubt that such personal preferences play a predominant role in an individual's evaluation of a literary book. But is this the case also for groups of people, or for society as a whole? Clearly, there are group tendencies, witnessed by the existence of fan clubs, for instance, for Sherlock Holmes, Irving Wallace, or Karl May, or young girls' interest in -chick literature‖, or avid readers of thrillers, detectives, or horror or vampire stories. At the level of society, things become much more diffuse. Clear examples are national epics, such as the Portuguese Os Lusíadas by Luìs de Camões, or the Divina Commedia in Italy. Several of Shakespeare's historical plays, especially Henry V, function as a national -preference‖ at the level of the nation. Here we may also situate historical fashions. Examples are the chivalric romances that swept over Europe during the Middle Ages, or the hundreds of robinsonades, or the waves of Werther translations. The so-called -canon wars‖ of the 1980s and 1990s take their place in this domain: themes such as race, gender and sexual orientation became the focus point for evaluating literary works.
Such content-oriented evaluation, however, will only work as long as the themes themselves remain. As soon as interest in such topics wanes, so will their positive evaluation. In other words, long-term evaluation will operate on content that has a universal appeal over longer periods of history. The appeal generally works if the ideal state of affairs remains unfulfilled. Thus the conflict between law and conscience, as epitomised in Sophocles' Antigone, remains as stinging as in Antiquity. Similarly, the star-crossed lovers' downfall amid a society that elevates vengeance and materialism above love will continue to grab our attention and emotion. The psychological machinery of seduction, of guilt, of religious belief, or the social mechanisms of injustice and humiliation, of inequality and poverty, are themes that may occupy a central place in the evaluation of a work's content.
In what follows, we will leave these considerations of content and instead concentrate on the second type of internal reasons for evaluation: the form or structure of a literary work. There is much to be said in favour of concentrating on this aspect. Many stories are alike, be they about love or adventures, something that has given rise to the development of story grammars, beginning with the work of Propp (1928). Often it is not just the events that are narrated in a literary work that fascinate us as much as the way they are told. There is a general pattern underlying most love stories, which could be paraphrased as: A loves B, but B does not love Awith some complications on the way leading to either a unification or a fatal separation (for instance, through death) of the lovers. There are literally thousands of stories based on this pattern. But there is only one Eugene Onegin, which is still singled out for its quality, not because of the story events, but precisely because of the form in which they are presented to the reader. Thus there may be good reasons to privilege formal and structural aspects of literary works in their (long-term) evaluation. But how to conceptualise such aspects in the concrete operations of the evaluative process?

Rereading as an evaluative tool
The preceding section has outlined the complexity (and necessity) of literary evaluation. In this complex situation, there have been few ideas how to come to terms with the processes of selection that the literary field is constantly engaged in. Fortunately, there has been an innovative and workable proposal recently, which will be the focus of our further deliberations here: the -rereading‖ test, as proposed by Marisa Bortolussi and Peter Dixon, both of the University of Alberta. In Psychonarratology (2003), the authors claim that -how readers process narrative is essentially an empirical question that can only be answered by systematic observation of actual readers reading actual texts; it cannot be answered solely on the basis of intuition, anecdotal evidence, or even sophisticated models of human experience‖ (idem, p. 13). Bortolussi and Dixon focus their research on the depth of appreciation of literary works and report a series of -rereading paradigm‖ experiments. One could argue that once a literary text is high on complexity, being rich in foregrounding elements and various artistic devices (thus presenting an example of -good‖ literature), these take time and effort to process. Consequently, such texts will induce more pleasure in the readers on a second readingunlike popular texts, where the pleasure will drop as readers confront the text for the second time. Similarly, Lüdtke, Jacobs & Xue (2018) apply the rereading approach to a poetic text and, using both indirect online (eye tracking) and direct offline (questionnaires, marking tasks, and ratings) methods, test whether readers' reactions to sonnets by Shakespeare are different on a first and second reading.
Other investigations, notably Hakemulder (2004) (2017), report similar results, namely that evaluative responses to quality literary texts rise on a second reading.
In order to further gain certainty concerning the method, we undertook some studies with different samples of readers and a slightly different experimental design. As we believe that appreciation of literature is connected to the amount of reading a person does, an additional measure we used in one of the studies concerns the so-called ART test, which we will detail in the following section.
Assessing print exposure: the ART test Stanovich et al. (1995) developed the ART test (Author Recognition Test) to assess respondents' general reading habits. The test involves a list of names, both real names (or pseudonyms) and fake names (foils) that sound like those of prominent people, but are definitely not authors. The latter are used to see whether participants fake knowledge of authors. The test is easy to administer, as it only takes about ten minutes. Participants are instructed to indicate the authors' names they recognise, but are also told that there are false names on the list. The test minimises the social desirability of answers as the number of real authors they indicate determines their overall print exposure score. The score (correctly identified authors minus ticked foils) gives a very good indication of the amount of reading that participants actually do. It also predicts the size of participants' vocabulary. Moreover, Stanovich et al. (1995) used a statistical technique (called hierarchical regression analysis) to examine whether other factors, such as age, gender, IQ, level of education, etc. might influence people's scores. The technique allows one to subtract the influence of each of these factors in turn. The net result of this endeavour was that after all factors had been subtracted, the amount of reading still predicted very well the size of people's vocabulary, but also general knowledge and verbal ability. In another study, Cunningham & Stanovich (1993), measures of exposure to print predicted individual differences in general scholastic ability, abstract reasoning, reading comprehension, and even mathematics ability.
A very thorough review of the value of ART has been provided by Mol & Bus (2011) who carried out a meta-analysis. They conflated the results of 99 empirical studies (with a total of 7,669 participants) to probe the relationship between print exposure and reading skills, spread over three groups: (1) pre-schoolers and kindergartners, (2) children attending Grades 1-12, and (3) college and university students. For all measures of reading comprehension, technical reading and spelling, moderate to strong correlations with print exposure were found. The authors explain the results as the outcome of an upward spiral of causality: children who are good at reading read more, and because they read more, their reading skills improve still further. Mol & Bus (idem) point to the amount of variance in language skills explained by print exposure, which they found at 12 % at pre-school age, 13 % in primary education, 19 % in high school, and 30 % at college and university level. It turned out that frequent readers are more successful students. Mar & Rain (2015) replicated the results, but additionally, they distinguished between reading fiction and non-fictional expository texts: -Exposure was measured both with self-report and with recognition tests of print exposure. Verbal ability was measured in the form of synonym knowledge, analogies, sentence completion, and reading comprehension in 4 different studies. Across all studies, narrative fiction was a better predictor of verbal abilities relative to expository nonfiction. When examining unique associations, controlling for demographic variables and the other genre, fiction remained a robust predictor, whereas nonfiction became a null or weak negative predictor‖ (p. 419). Moore & Gordon (2015) confirm all previous research on the ART, but found that recognition of particular authors varied substantially over relatively short periods of time, indicating cultural specificity. Hence they advise to regularly update the list of authors, and also adapt them to local situations.
In general, the ART is a powerful tool for predicting a range of cognitive aptitudes.

Methods
Study 1. The experiment was conducted in October -November 2017 at Borys Grinchenko Kyiv University, Ukraine. The sample included 55 participants: mostly MA students majoring in English Philology or Translation, but also several PhD students and staff members (almost exclusively female). The background by default assumed the respondents' EFL proficiency. The experiment was conducted in a conventional academic setting during the lecture course of one of the authors.
The participants were offered to read three texts (their order counterbalanced) and evaluate them on a 7-point Likert scale, using the three questions derived from the rereading paradigm by Bortolussi and Dixon (2003): (1) Is this text an example of good literature?
Did you enjoy reading it?
Would you recommend it to a friend? The texts we used were short paragraphs (from 103 to 124 words) from three novels, one popular and two canonical ones: When Love Awakes by Lee Stafford, Pride and Prejudice by Jane Austen and Mrs. Dalloway by Virginal Woolf (for full texts of the paragraphs, see Appendix 1). All three focused on love relations of the protagonists as we thought this theme could be engaging for the participants and stimulate their attention and interest in the experiment. In the initial part of the questionnaire, the respondents were informed that the experiment was about their personal reactions and the differences they notice between the texts. The questionnaires were prepared in Ukrainian to avoid any possible language misunderstanding, while the texts were presented in their original form in English.
The participants read the texts, answered the three questions and then reread the passages and filled in the questionnaires twice: in several hours after the first reading (to check the short-term effect) and in two days after it (to verify the long-term one). The order of the three texts on every reading was identical for every participant.
Additionally, to see whether frequency of reading plays a role, after the second reading the respondents completed the ART test, specifically adapted for the Ukrainian audience (see Appendix 2), derived from Stanovich & West (1989), Mol &Bus (2011) andHakemulder &Kuijpers (2017). The list of names (42 items total) included 6 authors of Ukrainian / Russian literary fiction, 6 authors of Ukrainian / Russian popular fiction, 9 authors of international literary fiction, 9 authors of international popular fiction and 12 foils.
At the end of the questionnaire, the participants indicated their gender, age and gave their formal consent to participate in the experiment.
Remember what, according to the theory, the rereading test would predict: canonical texts (in our case Texts 2 and 3) should be evaluated higher on a second reading, while the popular test (Text 1 in the experiment) should be evaluated lower on a second reading. The results, however, turned out quite differently, showing no significant difference between the three texts on the three readings. The results of comparison on the first and the second readings are presented in Graphs 1 and 2 below.
Graph 1. Responses to three questions across the three texts. Study 1.

Graph 2. Responses in a first and second reading across the three texts. Study 1.
The ART test scores proved to be rather low for a cohort of this kind, which could be explained by the general fall of reading amount in the country as well as by the fact that the list contained a considerable number of names of popular authors from the West, some of which could be little known in Ukraine.
Study 2. Respondents in this study were 39 volunteers in a convenience sample: participants in a course (-The Use of Art‖) organised by a Foundation for Art, Music and Philosophy in Flanders, Belgium. Their average age was 60+ with a predominance (80 %) of female participants. They had a more than usual interest in art, philosophy and literature, as testified by their participation in this course.
Similarly to Study 1, the same three texts (translated into Dutch by the first author) were read by the participants at 3 time intervals:  first reading;  two hours after the first reading;  one week after the first two readings. After having read the texts, readers gave their evaluative response to the three questions of Study 1, on a 10-point scale.
A one-way ANOVA revealed no significant differences between responses to the questions compared over the three reading moments. Neither were there significant differences in reactions to the three individual texts at the various reading times. There was only one exception: on reading the texts for the second time, the more complex texts (Jane Austen and Virginia Woolf) scored significantly higher in readers' evaluation as p-values for the three evaluative questions were .001, .010, and .005. The differences can be seen in Graph 3 below: Graph 3. Responses to three questions across the three texts on the second reading. Study 2.
No significant differences were noted either on the first or third reading. This does sound like a confirmation of Bortolussi and Dixon's theory that more canonical texts will be evaluated higher on a second reading. Yet this does not turn out to be the case: there are no significant differences in evaluating each of the three texts on a first, second or third reading. In other words: it is only during the second reading that the canonical texts are evaluated higher than the trivial text.
To get a more general picture of readers' reactions, we decided to conflate their responses to the three questions together. We may do so if there is a systematic correlation between replies given to each of the three questions. This reliability analysis yielded a Cronbach's alpha of .95, allowing us to join reactions to the three individual questions. We did that by creating a new variable, called Evaluation, which can now be used to compare readers' overall evaluative response to the three texts at three different points in time. The results can be seen in Graph 4 below:

Graph 4. Conflated responses to three questions across the three texts. Study 2.
As we can see from the graph, the general evaluation of the three texts occupies a middle ground, between 3 and 6 on a 10-point scale, so none of the texts is deemed very poor, but neither very valuable. More importantly, however, are the differences between the average evaluations of the texts themselves. We see that Text 1 (the trivial romance) is indeed lowest of the three, with Text 2 (Jane Austen) scoring highest, while Text 3 (by Virginia Woolf) is scored lower. The crucial thing for the theory, however, lies in the different evaluations at the rereading times. We notice that the trivial text is given values almost equal on a first and second reading, but its value drops markedly upon the third reading (after a week). Text 2 (Jane Austen), while being evaluated higher than Text 1, scores a little higher on a second, but then lower on the third reading. The most complex text, by Virginia Woolf, shows a highly similar pattern: evaluated higher on a second, but lower on a third reading.
However, as said before, hardly any of the differences noted are significant in statistical terms, so that in any case the experiment does not provide strong evidence for the view promoted by Bortolussi and Dixon. Note that the sample in Study 2 was comprised of readers that were much older than participants of Study 1, and not enrolled in university courses. But again their reactions did not correspond to the predictions.

Discussion and conclusions
Sadly, we know too little about what happens with readers (real or potential) when they come to (un)conscious decisions which books to read or reread, though in a massive overflow of current book production this is crucialboth economically and culturally. Some of the millions of titles published annually will stay for centuries while others will be forgotten even before the next generation of readers is born.
While continuous evaluative procedures are taking place in the literary system, literary scholars are fairly slow in looking at the mechanisms and evaluative operations that take place as well as the basic principles employed.
The distinction between external and internal reasons for evaluation we proposed above led us to clearly privilege the latter while the former, we believe, will only last as the themes and the external reasons remain, which means they are transient and do not lead to long-term evaluation of literary products. At the same time it is the second type of internal reasonsthe form and structure of a literary workthat plays a role and contributes to canon formation.
The complexity of this highly needed process of selection and literary evaluation the field is constantly engaged in, as we tried to demonstrate, requires accountable and objective methodology aimed at conceptualising such aspects in the concrete operations of the evaluative process. The rereading paradigm offered by Bortolussi and Dixon is an example of such workable test.
At the same time, the two experiments we have reported above present us with a problem: a range of previous studies on rereading had found support for the theory, yet our efforts to replicate the results have been unsuccessful and did not corroborate the predictions. We have to bear in mind that the experiments were conducted with two fairly different population samples and more than thatin two different cultural settings. Surely, it is too early to question the theory, but replication failure may lead us to reflect on some existing problems of literary evaluation theory.
In light of our failure to replicate the predictions, we find it necessary to further carry on the studies with different groups of respondents to check whether the cultural background of participants plays a role. From a more general perspective, what this research has pointed out is that certain aspects in the field of literary studies need to be revisited.

Appendix 1. Experimental materials.
Text 1. From When Love Awakes by Lee Stafford By morning, she had made all kinds of excuses for his brusquely unfeeling behaviour. Clearly, he had been worried. Soon, she told herself, he would be back, and in the light of the newly found trust which had grown between them he would surely give her some explanation of why he had had to leave with such alacrity. Then he would take her in his arms and kiss her, long and thoroughly, as he had before, and, at last, make love to her, as she was longing for him to do. She melted inwardly at the mere thought of it, and there was no longer any use trying to ignore the torment of her body and her emotions his absence was putting her through.
Text 2. From Pride and Prejudice by Jane Austen The evening passed quietly, unmarked by anything extraordinary. The acknowledged lovers talked and laughed, the unacknowledged were silent. Robert was not of a position in which happiness overflowed in mirth. And Julia, agitated and confused, rather knew that she was happy, then felt herself to be so; for, besides the immediate embarrassment, there were other evils before her. She anticipated what would be felt in the family when her situation became known; she was aware that no one liked him but Jane; and even feared that with the others it was a dislike which not all his fortunes and consequences might do away.
Text 3. From Mrs. Dalloway by Virginal Woolf For only a moment; but it was enough. It was a sudden revelation, a tinge like a blush which one tried to check and then, as it spread, one yielded to its expansion, and rushed to the farthest verge and there quivered and felt the world come closer, swollen with some astonishing significance, some pressure of rapture, which split its thin skin and gushed and poured with an extraordinary alleviation over the cracks and sores. Then, for that moment, she had seen an illumination; a match burning in a crocus; an inner meaning almost expressed. But the close withdrew; the hard softened. It was overthe moment.