Iconicity affects children’s comprehension of complex sentences: The role of semantics, clause order, input and individual differences

Complex sentences involving adverbial clauses appear in children's speech at about three years of age yet children have difficulty comprehending these sentences well into the school years. To date, the reasons for these difficulties are unclear, largely because previous studies have tended to focus on only sub-types of adverbial clauses, or have tested only limited theoretical models. In this paper, we provide the most comprehensive experimental study to date. We tested four-year-olds, five-year-olds and adults on four different adverbial clauses (before, after, because, if) to evaluate four different theoretical models (semantic, syntactic, frequency-based and capacity-constrained). 71 children and 10 adults (as controls) completed a forced-choice, picture-selection comprehension test, providing accuracy and response time data. Children also completed a battery of tests to assess their linguistic and general cognitive abilities. We found that children's comprehension was strongly influenced by semantic factors - the iconicity of the event-to-language mappings - and that their response times were influenced by the type of relation expressed by the connective (temporal vs. causal). Neither input frequency (frequency-based account), nor clause order (syntax account) or working memory (capacity-constrained account) provided a good fit to the data. Our findings thus contribute to the development of more sophisticated models of sentence processing. We conclude that such models must also take into account how children's emerging linguistic understanding interacts with developments in other cognitive domains such as their ability to construct mental models and reason flexibly about them.


Introduction
In order to construct a coherent mental representation of the events described in complex sentences, listeners must be able to interpret connectives to establish the semantic relationship (e.g., temporalityafter, when etc., causalitybecause, since, concessionalthough, even if etc.) between the main-and the subordinate clause. An additional challenge for listeners is that in English (and other languages, but not in all) the two clauses can occur in two orders. Compare "She had a cup of coffee before she submitted the paper" and "Before she submitted the paper, she had a cup of coffee". In the first sentence, the clause order reflects the order of events in the real worldit is 'iconic'. In the second sentence, the clause order is reversed.
Although complex sentences involving adverbial clauses appear in children's speech at about three years of age (Diessel, 2004), experimental studies found that children have difficulty comprehending these sentences even at the age of six, nine, or even twelve years (e.g., Emerson & Gekoski, 1980;Johnson & Chapman, 1980;Pyykönen, Niemi, & Järvikivi, 2003). They misinterpret the temporal order, or reverse cause and effect in causal sentences. Researchers have suggested different explanations to account for theseoften conflictingfindings. But because individual studies have typically looked at only one type of adverbial clause, and used varying methodologies, it is difficult to determine possible differences and commonalities in the precise influences of different factors on children's performance across sentence types. The present study investigates the comprehension of four different sentence types (after, before, because, if), to test the predictions of four different theoretical accounts.
We first provide a brief characterisation of the four sentence types under investigation, together with a short discussion of causality, which is central for the understanding of because-and if-clauses. We then present four different theoretical accounts of complex sentence processing in children that we have identified in the literature: (1) the semantic account, which assumes that iconicity is the main factor; (2) (1) The cup broke because it fell off the table.
(2) She must be a queen, because she is wearing a crown.
(3) Can you tell me what time it is, because I have this meeting at one.
In (1), there is a clear causal relation between the two events, and the two events take place in the world independent of the speaker. This type of causality has been called physical or content-level causality. In (2), in contrast, the speaker is using the because-clause as evidence for her (subjective) belief. This type of causality is said to take place on the epistemic level (epistemic causality). Finally, in (3), the because-clause functions as a reason for the speaker's requestit takes place on the level of the speech act (speech act causality). Other scholars have suggested dichotomous distinctions such as objective (content) vs. subjective (epistemic and speech-act) causality (Lois Bloom & Capatides, 1987).
Like because-sentences, if-sentences can be used to express contentrelations, epistemic relations, and speech act relations between clauses. In the content domain, if-sentences typically express causal relations via predictions (Dancygier & Sweetser, 2000: 121), as in "If you take this, you'll feel better".
Our study investigates children's comprehension of sentences expressing content-level or physical causality. Note that in this case, there is also a clear temporal element in the semantic relationship between the two events: The cause precedes the effect. However, it is worth pointing out that in conversation, describing causally linked events is not the primary function of because-and if-sentences. In spoken discourse, because-clauses typically provide a reason for a statement made (speech-act causality), rather than a cause for an effect (Diessel & Hetterle, 2011). And if-clauses often provide a conceptual framework for the interpretation of the following discourse, not just the main clause within the complex sentence (e.g., Ford & Thompson, 1986). For example, a speaker may say: "If the weather is good tomorrow, we could go for a hike", before providing more details for that proposal. We will return to this distinction between the semantics of because-and ifclauses and their communicative function at various points in this article.
As noted above, in English, complex sentences can occur in two clause orders: main-subordinate and subordinate-main. (Note that this is true only for adverbial sentences, not for other types of complex sentences.) For each sentence type (after, before, because, if) one clause order reflects the order of events in the real world, while the other reverses it. Table 1 illustrates the interaction of connective and clause order yielding (non-) iconicity. For after-, because-, and if-sentences, subordinate-main clause orders are iconic. For before-sentences, however, main-subordinate clause orders are iconic.
Iconicity is the central aspect in the semantic account of children's comprehension of complex sentences, which is the first of four different accounts, to which we turn now.
1.2. Theoretical accounts 1.2.1. Semantic account Clark (1971) conducted the first experimental study on the acquisition of the temporal connectives before and after, looking at both production and comprehension in three-to five-year-olds. In the comprehension task, children were asked to act out sentences like "He patted the dog after he jumped the gate" with toys. Not surprisingly, younger children made more errors than older children. In addition, children of all age groups made more errors with those sentences that were non-iconic, and more errors with sentences containing after than with sentences containing before. These findings led her to suggest that children's comprehension of complex sentences is driven primarily by a semantic principle. Children initially employ an "order-of-mention" strategy: They assume that what they hear first, happens first. In other words, a sentence is being interpreted by assuming a direct mapping (analogy) between the sequence of events in the linguistic form (clause order) and the sequence of events in the real world. As a consequence, children interpret iconic sentences correctly, but misinterpret noniconic sentences. A correct understanding of both orders emerged in her sample at around age five. It should be pointed out that Clark based her account on an experiment that included only temporal clauses, and did not specify to what extent it should also apply to other complex sentence types. However, it seems reasonable to assume that if children operate with an order-of-mention strategy on the incoming speech stream, they would do so also with causal and conditional sentences, where these describe a causal relationship between two events.
Clark furthermore suggested that before and after differ in terms of their semantic features. The underlying assumption is that words are made up of a number of semantic features, which can have positive or negative values, such as [ ± Prior]. In this framework, it is assumed that after is more complex than before (see Clark, 1971, for details), which results in an asymmetric acquisition of the two sentence types. Children would start out with wrongly interpreting after as before.
Subsequent studies that went on to test Clark's hypotheses used a variety of different methods and investigated different age groups (see Table 2), and produced contradictory results. Regarding the comprehension of iconic vs. non-iconic sentences, several studies, including recent ones, have observed better performance with iconic sentences (Blything & Cain, 2016;Blything, Davies, & Cain, 2015;Feagans, 1980;French & Brown, 1977;Stevenson & Pollitt, 1987;Trosborg, 1982, for Danish), although the strength of the evidence is limited for some studies by the fact that they did not manipulate clause order (i.e., order of main-and subordinate clause) (Feagans, 1980), or confounded clause order with plausibility (French & Brown, 1977). Other studies, however, failed to find an advantage for iconic sentences (Amidon, 1976;Gorrell, Crain, & Fodor, 1989;Keller-Cohen, 1987).
Regarding the difference between the two connectives before and after, previous research has, again, produced divergent results. In line with Clark's original findings, several studies have found moderate to strong advantages for before (Blything & Cain, 2016;Blything et al., 2015;Feagans, 1980;Johnson, 1975), including faster response times in a picture-selection task to sentences containing before (Blything & Cain, 2016), while others either did not observe a significant difference between the two (Amidon, 1976;Amidon & Carey, 1972;French & Brown, 1977;Gorrell et al., 1989;Johnson, 1975), or found the opposite, that is, after being acquired earlier/being easier than before (Carni & French, 1984).
For because-and if-sentences, the evidence supporting the semantic account is even less clear. This is in part due to methodological issues. On the one hand, many of the studies had relatively high task demands such as requiring meta-linguistic judgments (Corrigan, 1975;Emerson, 1980;Johnson & Chapman, 1980). On the other hand, many used sentences that were constrained by world-knowledge and plausibility (e.g., Kuhn & Phelps, 1976). In order to gauge children's purely linguistic understanding of the meaning of a connective, it is necessary to remove any cues that could guide their interpretation other than the sentence itself. Emerson (1979) addressed this by using so-called reversible sentences, that is, sentences whose reversed meaning is also plausible. She presented children between 5;8 and 10;11 with two different three-frame picture sequences, one corresponding to the order of events in the test sentence, and one showing the opposite order. The children's task was to select which of the two sequences went with the test sentence. Emerson found that children performed better with iconic sentences in which the cause preceded the effect (e.g., "Because he could hear the loud noises and the laughing he went outside"). Only the eight-year-olds were able to make correct selections with non-iconic sentences (e.g., "He went outside because he could hear the loud noises and the laughing"). Emerson and Gekoski (1980) used the same methodology to test the comprehension of because-and if-sentences in children between 2;8 and 11;11 years, complemented by additional tasks such as asking children to judge the equivalence of meaning in sentences with different connectives (because/so, if/then) or clause orders. Again, above-chance performance was found only at around eight years, but unlike Emerson's (1979) study, they did not find any effect of iconicity. Amidon (1976) who used a command-task ("If the light comes on, you move the car") similarly found no evidence for an iconicitypreference with if-sentences in five-to-nine-year-olds, but she found above-chance performance already in the youngest age-group.
To summarise, there is some, albeit not unequivocal, evidence in support of the semantic account in children: Children seem better at comprehending iconic temporal sentences, and there is some evidence that before-sentences may be acquired earlier/be easier to process than after-sentences. The role of iconicity for because-and if-sentences is, however, less clear.
We are aware of only three studies that explicitly studied adult processing of isolated sentences containing after and before, one study that looked at because, and as yet no study using if. Clark and Clark (1968) gave participants sentences like "After he tooted the horn, he swiped the cabbages" to memorise, together with a noun cue ("the boy"). Participants were then presented with only the noun cues and asked to recall the corresponding sentence. They found that recall was better with iconic sentences. Smith and McMahon (1970) replicated these findings. Münte, Schiltz, and Kutas (1998) used event-related brain potentials (ERPs) to investigate listeners' processing of sentences with before and after. Critically, they only compared two types of sentences with each other: iconic after-sentences and non-iconic beforesentences. They observed that the before-sentences elicited greater negativity, and that the size of the effect was correlated with individual working-memory spans, with individuals with higher spans showing larger negative effects. Münte et al. suggested that this reflects the differential involvement of working memory during the processing of iconic and non-iconic sentences. However, given that clause order and connective type were confounded with iconicity, it is unclear if the observed effect can be attributed to iconicity alone. Finally, in a study on reading comprehension, Irwin (1980) found that college students' Table 2 Overview of previous studies on children's comprehension of complex sentences, indicating the connectives studied (only those relevant for the present study), ages covered (rounded), and tasks used.

Syntactic account
A competing hypothesis is that the comprehension of complex sentences is mainly affected by syntactic form. Specifically, Diessel (2005) suggested that not only children, but listeners in general, find main-subordinate orders easier to process. For this proposal, he adapted Hawkins' ''performance theory of order and constituency'' (Hawkins, 1990;Hawkins, 1992;Hawkins, 1994). In a nutshell, Hawkins assumes that certain syntactic configurations make it easier for the parser to recognise the structure it is currently parsing and to build a hierarchical syntactic representation. In the case of complex sentences, initial connectives like after as in 4(a), signal that the structure is a complex sentence. According to Diessel (2005), this requires the parser to keep the subordinate clause in memory until the main clause can be parsed and the complex sentence fully constructed. In 4(b), in contrast, the main clause can be fully processed first. When the subordinate clause is encountered, it can be parsed and attached directly to the representation. Main-subordinate orders are thus easier to process, because they have a shorter "recognition domain": Fewer words must be parsed in order to recognise the syntactic structure of the sentence (see Hawkins, 1992, p. 48 for a formal definition). Diessel (2005), Diessel (2008) acknowledged that in production, factors other than syntactic structure play a role in determining the clause order, namely discourse-pragmatic forces, and semantics (iconicity). From a pure processing perspective, however, listeners should find isolated complex sentences easier to process if they occur in mainsubordinate order.

4(a) [[
To our knowledge there has been no language acquisition study that found support for this hypothesis. Some of the earlier studies cited above, which did not produce corroborative evidence for the semantic account, reported that children appear to understand main clauses better than subordinate clauses (Amidon, 1976;Amidon & Carey, 1972;Gorrell et al., 1989;Johnson, 1975;Stevenson & Pollitt, 1987), but not that main-subordinate orders were comprehended better. While these findings do not support Diessel's hypothesis, they could be taken to indicate that syntax, more specifically, syntactic constituency (main vs. subordinate) plays a role in children's sentence comprehension. It is notable, however, that all the studies that reported a "main clause effect" used a version of the command-task mentioned before (e.g., "Before you move the blue plane, move the red plane"). It was observed that the majority of errors in the children's responses were errors of omission, rather than reversal errors, as observed by Clark and others who used the act-out paradigm. Specifically, children tended to omit the command given in the subordinate clause. Researchers have pointed out that the results may be due to the infelicitous use of a sentence like "Before you move the blue plane, move the red plane" in the experimental set-up. Sentences like these could be "used only when the hearer has established the intent to perform the action mentioned in the subordinate clause" (Gorrell et al., 1989: 625). If this presupposition condition were not met (i.e., if the action in the subordinate clause is not part of the common ground), children would simply ignore this part of the complex sentence. What appears at first sight to be a syntactic effect is thus probably more likely a pragmatic one.
For adults, Clark and Clark's (1968) study on recall of before-and after-sentences foundin addition to iconic orders being recalled better than non-iconic onesthat participants performed better with mainsubordinate orders. Unlike the iconicity-effect, this facilitative effect of clause order was, however, not replicated by Smith and McMahon (1970).
Overall, the evidence for the syntactic account as put forward by Diessel (2005) is not very strong.

Frequency-based account
Usage-based approaches to language acquisition posit that children's acquisition of grammatical structures is influenced by the frequency of these structures in the children's language input (for an overview, see De Ruiter & Theakston, 2017). Frequency-effects have been observed for a range of syntactic constructions. A frequency-based account would predict that the frequency of order combinations in connective clauses in the input affects children's comprehension of adverbial clauses. Specifically, one would expect that children find those connectives and order combinations which are more frequent easier to understand than those that are less frequent. Both analyses of general language corpora (Diessel, 2001;Diessel, 2008) and corpora of child-directed speech (De Ruiter, Theakston, Brandt, & Lieven, 2017) have found that (a) because-and if-sentences are much more frequent than after-and before-sentences, and (b) there are clear clause order preferences for three of the four sentence types: • if-sentences occur primarily in subordinate-main order; • before-and because-sentences occur primarily in main-subordinate order.
For after-sentences, the picture is less clear. Some studies found that they occur more often in main-subordinate order (Diessel, 2008), others found a preference for subordinate-main orders Diessel, 2005).
If input-frequency influences processing, children (and possibly adults) should find because-and if-sentences easier to process, and should show facilitative effects for the preferred clause orders of each sentence type, all else being equal. Note that with respect to clause order, the semantic account and the frequency-based account make the same predictions for before and if-clauses, because for these sentences, the iconic clause order is also (and probably not accidentally, e.g., Diessel, 2005) the most frequent one. Different predictions emerge for because-sentences, however: While the semantic account predicts that sentences beginning with because are easier to process (subordinatemain), the frequency-based account would predict that sentences in which the because-clause follows the main clause are easier to process/ acquired earlier.
However, frequency effects can occur on different levels of abstraction. Children and adults are also sensitive to discourse-based and semantic features of lexical items that are most frequently used in specific constructions. For example, Kidd and colleagues (Brandt, Kidd, Lieven, & Tomasello, 2009;Kidd, Brandt, Lieven, & Tomasello, 2007) found that children most often hear object relative clauses with inanimate head nouns and pronominal subjects, and they also understand these complex sentence types best when the sentences are formed according to these constraints. A prototypical feature of complex sentences is that they contain transitive verbs . One would thus expect that complex sentences with transitive verbs pose fewer difficulties for children than sentences with intransitive verbs.
There haveto our knowledgenot been any investigations of the links between input frequencies of complex sentence forms with adverbial clauses and children's comprehension of these sentences. However, with the corpus findings regarding the different frequencies of connectives and clause orders in mind (see above), we can evaluate the results of previous studies. The only study that covered three of the four connectives (after, before, and if) found that five-to-nine-year-old children showed overall lower error-rates with if-sentences than with after-and before-sentences (Amidon, 1976), in support of a frequencybased account. Moreover, to the extent that children have shown a tendency to perform better with iconic sentences (see semantic account above), these sentences also reflect the more frequent clause orders for before and if-sentences, (and possibly after-sentences) in spoken English. But the evidence for because-sentences, which occur more often in (noniconic) main-subordinate orders, is rather sketchy. On the other hand, the approximate ages at which children have been reported to perform above-chance in their comprehension of complex sentences in the various studies indicate that because-and if-sentences may show a more protracted development than after-and before-sentences. This would seem to contrast with what would be predicted on the basis of a pure form-frequency-based account.

Memory capacity-constrained account
Theories of capacity constraints in memory (e.g., Just & Carpenter, 1992) assume that short-term memory 1 plays a central role in sentence processing, and, crucially, that there are individual differences in the resources that a listener (or reader) has at their disposal. As a consequence, individuals with lower memory capacity will find it more difficult to keep more information in active storage during parsing.
Note that the capacity-constrained account is not compatible with the semantic account, because children's use of the iconicity principle (and the semantic features account) is not assumed to be linked to memory in any way. The capacity-constrained account is, however, in theory compatible with both the syntactic and the frequency-based account. The syntactic account makes explicit predictions about the processing difficulty associated with the two clause orders. It is possible that difficulties with subordinate-main orders are exacerbated by low short-term memory capabilities. The frequency-based account does not say anything about the influence of memory, but there is no a priori reason why frequency-effects could not be modulated by working memory. For the syntactic and the frequency-based account, then, the capacity-constrained account provides an additional hypothesis, rather than an alternative: Children with better working memory should perform better in complex sentence comprehension tasks than children with lower working memory capabilities. Blything and Cain (2016), who investigated three-to seven-year-old children's comprehension of sentences with before and after, found some support for the capacity-constrained account. Performance in terms of accuracy and speed (response time) was predicted better by children's scores on a memory task (digit span) than by age or vocabulary (Blything & Cain, 2016). To our knowledge there have been no studies that examined the link between memory and comprehension of because-and if-sentences. Studies that investigated the role of working memory in the processing of other types of complex sentences (e.g., passives, relative-clauses) have found that memory significantly predicted sentence comprehension over and above the influence of age (Magimairaj & Montgomery, 2012;e.g., Montgomery, Magimairaj, & O'Malley, 2008).
For adults, Münte et al. (1998) found that participants with higher working memory spans showed a more pronounced difference between before-and after-sentences in terms of ERP negativity. They took this to indicate that these participants were probably better comprehenders, although the study did not directly measure comprehension.
Taken together, there is some evidence that individual memory capacities influence complex sentence processing in general, but up to this point there is only limited support for this hypothesis for adverbial clause processing specifically.
To sum up: There are four different theoretical accounts for the comprehension of complex sentences: the semantic account, the syntactic account, the frequency-based account, and the capacity-constrained account. More than four decades of research have produced some support for each of the four accounts, but because researchers have typically focussed on certain types of sentences, and used a plethora of different methods, it is difficult to decide between them. Our study evaluates and compares the predictive adequacy of these different accounts. We also consider how they may interact in the Discussion.

The present study
Our study tests the predictions made by different theoretical accounts across four different sentence types (after, before, because, if) by using the same methodology (forced-choice, picture-sequence selection) for all types and testing the same children (within-subjects design), as well as including measures of short-term memory. Because it is unclear what the role of individual differences in general language ability and executive function may be in complex sentence comprehension, and in order to control for potential confounding factors, we furthermore collected measures of general language ability and executive function (inhibition). We also tested children's understanding of the temporal priority principle (causality). If the children in our sample generally understand (event) causality, then a failure to comprehend the causal sentences must be due to a lack of linguistic rather than conceptual knowledge. In addition, we tested an adult control group to provide a baseline.

Participants
Seventy-one children and ten adults participated. The children were recruited through nurseries and primary schools in the Greater Manchester area. Prior informed consent was obtained from caregivers/ parents. All children were monolingual, native speakers of English without any known history of speech or language problems or developmental delays. Of the 71 child participants, 37 were between 3;6 and 4;5 years old (M = 47 months, SD = 3.8, 20 girls), and 34 were between 4;6 and 5;5 years old (M = 60 months, SD = 3.1, 25 girls). We will refer to the first group as the four-year-olds, and the second group as the five-year-olds. Eight additional children were tested, but their data had to be excluded because they turned out to be bilingual (three participants), too old (two participants), too young (one participant), or because they did not understand the task (two participants). One child refused to do the second session, while the second session with another child had to be aborted shortly before completion due to concentration problems, resulting in the loss of two responses. A technical failure caused the loss of three responses with another participant. Half of the data set of one child was lost due to experimenter error. The adult participants (N = 10, M = 33 years, seven women) were students or staff at the University of Manchester, and native speakers of English.

Materials and procedure
The children were tested in a quiet area in their nurseries and primary schools. In addition to the sentence comprehension test, children completed five tasks on general language ability, short-term memory, executive control, and understanding of causality (all detailed below), spread over two sessions on two days. Each session lasted between 25 and 40 min. Children completed half of all items of the sentence comprehension task in session one, and the other half in session two. The language ability tasks and the executive control tasks were administered in session one. The memory test and the causality test were administered in session two. In both sessions, children always first completed the sentence comprehension task before doing the other tasks.
1 While Just & Carpenter use the term "working memory", we prefer to describe the capacity involved as "short-term memory", because the task doesn't involve manipulation of the stored information. But the two terms are often used interchangeably, and researchers have difficulties separating the two constructs (see Aben, Stapert, & Blokland, 2012 for a discussion).
The allocation of trials across sessions and the experimental lists are described in Experimental lists below. Adult participants did only the sentence comprehension task and completed all items in one session, with a short break between the two blocks.

Sentence comprehension
Participants' comprehension of complex sentences was tested using a forced-choice picture-sequence selection task on a touch-screen. The task was to select out of two picture sequences the one that matched an aurally presented sentence. This allowed us to collect both response accuracy and response time measures.
2.2.1.2. Audio stimuli. 24 complex sentences were constructed, each containing a main and subordinate clause representing two actions performed by a single actor (a boy in half of the sentences, and a girl in the other half). There were six sentences per connective after, before, because, and if. The because-and if-sentences always expressed a physical causal relationship between the two events (i.e., not epistemic or speech act relations). The stimuli clearly emphasised the causal interpretation of these sentences (there was always only one person in each scene, making the use of speech act-causality implausible). Within these six sentences, half (three) contained only intransitive verbs, the other half contained only transitive verbs. The objects of the transitive verbs were always inanimate objects. Each sentence occurred in both clause orders (main-subordinate and subordinate-main), resulting in 48 sentences overall. The subject of the sentence was always expressed as a pronoun (i.e., he or she), and all verbs were in present tense. All sentences were between 11 and 13 syllables long. (All experimental sentences can be found in Table A1 in Appendix A.) The sentences were spoken by a female native speaker of British English, and recorded in a quiet room using a digital voice recorder. The stimuli were processed using the software Praat (Boersma & Weenink, 2016), version 6.0.13. Each sentence was first cut into two clauses, and then spliced together again with a pause of 250 ms. The overall intensity of all stimuli was set to 60 dB.
2.2.1.3. Visual stimuli. For each audio stimulus (complex sentence), two picture sequences were created (for an example, see Table 4), showing the two actions expressed by the sentence in both orders (in left-to-right orientation, which is the convention in English picture books). For the sentences containing before and after, the second picture sequence was the reversal of the pictures of the first picture sequence. This was not possible for the sentences containing because and if, since the semantics of these sentences requires there be some change of state involved. For example in the sequence matching the sentence "Because he opens the door he sees the snowman", the actor first opens the front door and then finds a snowman outside his house. The other sequence has to offer a plausible scenario for the opposite order of events (i.e., first seeing, then opening) in order to be an acceptable distractor. In this case, the actor was depicted as looking out of the window and seeing a snowman, and then opening the door (to have a better look at the snowman). The stimuli were created using the software Anime Pro (version 9.1).

Presentation.
The stimuli were presented using the software E-Prime (version 1.2) on a laptop with a 14-inch resistive touch-screen. The sound was presented via loudspeakers. Table 3 Conditions of the experiment, 4 connectives × 2 clause orders (main = main clause, sub = subordinate clause) × 2 verb types (transitive, intransitive).

Connective after before because if
Clause order main-sub sub-main main-sub sub-main main-sub sub-main main-sub sub-main

Transitive verbs
She hoovers the house after she paints the old fence.
After she paints the old fence, she hoovers the house.
He plays his big drum, before he reads his new book.
Before he reads his new book, he plays his big drum.
He opens the door, because he sees the snowman.
Because he sees the snowman, he opens the door.
She hears the doorbell, if she presses the button.
If she presses the button, she hears the doorbell.

Intransitive verbs
He drives away fast after he shouts out loudly.
After he shouts out loudly, he drives away fast.
She hops up and down before she crawls on the floor.
Before she crawls on the floor, she hops up and down.
She slips to the ground, because she looks at the sky.
Because she looks at the sky, she slips to the ground.
He falls in the field, if he sneezes lots of times.
If he sneezes lots of times, he falls in the field.

Table 4
Structure of the experimental trials.
Visual presentation Auditory presentation blank screen "Look and listen carefully! Touch the matching story after the beep!" "After she paints the old fence, she hoovers the house." 1000 ms pause "After she paints the old fence, she hoovers the house." beep L.E. de Ruiter et al. Cognition 171 (2018) 202-224 2.2.1.5.1. Children. The children sat at a table in front of the laptop. In front of the laptop there were two pieces of red cardboard in hand shape fixed to the table. The children were asked to keep their hands on these markers throughout the experiment when they were not selecting a sequence. The children were told that they were going to play a game, in which a lady was telling them stories about two characters, Sue and Tom, and about some animals, and that they had to select from two picture stories the one that matched the sequence that they had heard. The children were instructed to listen carefully and touch the matching sequence after they hear a beep.
Before the start of the actual experiment, there was a warm-up phase to familiarise the children with the task and the left-to-right reading of the picture sequences. In the warm-up, the second presentation of the sentence (see below for details of the set-up) was not automatic, but manually controlled by the experimenter, which allowed the experimenter to explain the layout of the screen before playing the sentence again (e.g., "Here we see that Tom is doing two things in this story. First he is watering his plants. And then he switches the light on", while pointing to the appropriate picture). The first two warm-up trials were like the filler trials (i.e., simple sentences with only two pictures; see below). The other warm-up trials were like the experimental trials, except that the sentences were of the structure "First, …, then…". If a child did not choose the correct picture in any of the warm-up trials, feedback was given and the trial was repeated up to two times. If the child still made the wrong selection, the experimenter proceeded to the experimental trials, but noted that the child had failed to complete the warm-up successfully.
The structure of the experimental trials is shown in Table 4. Before each trial, there was a picture of the character that the next "story" was about (i.e., a picture of Sue or Tom). The experimenter would say something like "Ah, here's another story about Sue. Let's see what she's doing!" to focus the child's attention on the next trial. When the experimenter was sure that the child was paying attention, she started the next trial. The child would first hear the instruction "Look and listen carefully! Touch the matching story after the beep!" 2 , while seeing a blank screen. Then the sentence was played, with the screen still blank. Directly after the presentation of the sentence, the two picture sequences were displayed on the screen. After a pause of 1000 ms, the sentence was repeated, followed immediately by a beep. Once the child had selected a sequence, the screen showed a blue circle to indicate that the trial had been successfully completed. Response time was measured from the offset of the beep. If the child was distracted during a trial, the experimenter repeated the trial.
After every three trials there was a filler trial to give children a small break with relatively easier items. The structure of the filler trials was the same as that of the experimental trials, the difference being that children were presented with a simple sentence (e.g., "Lion is drying his hair.") and only two pictures to select from (e.g., a lion drying his hair and a lion buttoning his coat).
The entire experiment took between 15 and 20 min. 2.2.1.5.2. Adults. The adult participants were tested in a quiet room, using the same set-up as with the children. Instead of using the hand-shaped markers adults were simply instructed to keep their hands in front of the laptop unless they were selecting a picture sequence. Participants were instructed to listen to the sentence and select the matching sequence after the beep. The warm-up was the same as with the child participants, but no elaborate explanations were provided. After the participants had successfully completed the warm-up, they went through half of the trials, followed by a short break, and then completed the other half of the trials. Overall the experiment took about 10 to 15 min.
2.2.1.6. Experimental lists. Four different experimental lists were constructed. Each list consisted of two sessions. Each sentence (N = 24) occurred once in each session (recall that each sentence occurred in two clause orders), with half of the sentences in each session being in main-subordinate clause order and the other half in subordinate-main clause order. There were three items in each condition. List 2 was created by swapping session 1 and session 2 of List 1. Lists 3 and 4 were the same as Lists 1 and 2, with the difference that all after-sentences were turned into before-sentences and vice versa, and all if-sentences were changed into because-sentences and vice versa (see Table A1 in Appendix A).
The order of the trials within each session was pseudo-randomised. There was a maximum of two consecutive trials in the same condition. The position of the correct picture sequence in each session was counterbalanced, so that in half of the trials the correct picture sequence was at the top and in the other half of the trials at the bottom. In addition, the position of the correct picture sequence across sessions was counterbalanced, so that for any given scene, when the correct picture was at the top in session 1, it was at the bottom in session 2, and vice versa.
Participants were randomly assigned to one of the four experimental lists.

Language ability
Measures for children's receptive language ability were collected using two sub-tests of the Clinical Evaluation of Language Fundamentals®-Preschool-2 (CELF-Preschool-2; Wiig, Secord, & Semel, 2004): "Linguistic Concepts" and "Sentence Structure". The sub-test "Linguistic Concepts" requires the child to follow directions of increasing length and complexity (e.g., "Point to either of the monkeys and all of the tigers."). The sub-test "Sentence Structure" is a forcedchoice picture selection task that tests the child's comprehension of sentences of increasing length and complexity (e.g., "The man who sits under the tree is wearing a hat."). Each sub-test lasted approximately 5 min.

Executive control
Children's executive control was tested using two tasks: the "Day/ Night task" (Gerstadt, Hong, & Diamond, 1994), and the dimensional change card sort (DCCS) task (Zelazo, 2006). In the Day/Night task, children are instructed to say "day" when they are shown a card with a picture of a moon on it, and to say "night" when shown a card with a picture of a sun on it. The task taps into children's ability to inhibit the intuitive response (e.g., to say "night" when they see a picture of a moon). In the DCCS task, children are required to sort a series of bivalent test cards, first (pre-switch phase) according to one dimension (colour), and then (post-switch phase) according to the other (shape). The task taps into children's flexibility to switch their attention to a different dimension. Both tasks together took about 5 min (16 trials in the Day/Night task, 12 trials in the DCCS task).

Memory
Phonological and verbal short-term memory was tested using three tasks, taken from the Early Repetition Battery® (ERB; Seeff-Gabriel, Chiat, & Roy, 2008): word repetition and non-word repetition (which are combined into the "Preschool Repetition Test", PSRep), and "Sentence Imitation Test" (SIT). All three tasks together took between 5 and 10 min.

Causality
Children's understanding of the temporal priority principle (i.e., the principle that causes must precede their effects) was tested using a modified version of the set-up used by Rankin and McCormack (2013). Children have to decide which one of two events (A, B) causes an effect 2 One reviewer remarked that, while it is rather unlikely, using the word "after" in the instructions might have positively impacted the children's performance. The results suggest that this was not the case, as the children's performance on after was worse than with before.
L.E. de Ruiter et al. Cognition 171 (2018) 202-224 (E). In the task, children observe one event (A), an effect (E), and then another event (B). The events A and B are marbles rolling down runways, and the effect E is the ringing of a bell. There were four experimental trials. The task took about 5 min.

Predictions and analyses
Based on the four accounts outlined in the introduction, we list a number of different hypotheses regarding children's performance accuracy in the sentence comprehension task: 1. Iconic clause orders are comprehended better/acquired earlier than non-iconic clause orders. (semantic account) 2. Before-sentences are comprehended better/acquired earlier than after-sentences. (semantic account) 3. Main-subordinate orders are comprehended better/acquired earlier than subordinate-main orders. (syntactic account) 4. Because-and if-sentences are comprehended better/acquired earlier than after-and before-sentences. (frequency-based account) 5. Frequent connective-clause order combinations are comprehended better/acquired earlier than infrequent ones. (frequency-based account) 6. Sentences with transitive verbs are comprehended better/acquired earlier than sentences with intransitive verbs. (frequency-based account) 7. Memory should make an independent contribution to performance, in that children with higher memory scores perform better than children with lower memory scores. (capacity-constrained account) The accounts do not make explicit predictions about the speed of processing (response times), but it seems reasonable to assume that those structures that are easier to comprehend would also be processed faster.

Results
A total of 3907 responses were recorded. After screening of the data for deviations, the data of one child participant was removed, because he had consistently touched the top right-hand corner of the touchscreen, and also confirmed this when asked about it after the experiment. As a result, 48 responses (1% of the data) were excluded.

Analysis strategy
We first present the results for the sentence comprehension task (accuracy and response times). We then present the results (raw scores and standardised scores, where applicable) for the other tasks, and test if the individual difference scores in those tasks explain performance over and above the effects of our experimental manipulations.
For accuracy and response times (RTs), a series of (Generalized) linear mixed effect models (GLMMs; Baayen, Davidson, & Bates, 2008) was fitted to the data using R (R Core Team, 2016), version 3.3.1. We used glmer for the binomial accuracy dependent variable, and lmer for the continuous response times (RT) dependent variable, both from the R package lme4 (Bates, Maechler, Bolker, & Walker, 2015). We used the R packages lmerTest (Kuznetsova, Brockhoff, & Bojesen Christensen, 2016) and pbkrtest (Halekoh & Højsgaard, 2014) for the calculation of p-values for lmer models. (G)LMMs allow incorporating both fixed effects (experimental manipulations) and random effects (variation specific to individual participants and individual items). Following Bates, Kliegl, Vasishth, and Baayen's (2015) recommendations, we added fixed and random effects incrementally to a minimal model, and tested if the inclusion of an additional term was justified using the likelihood ratio test for model comparisons (Pinheiro & Bates, 2000), and pruned non-significant effects, unless they were part of a significant interaction. All final models contained random intercepts for participants and items. In addition, we ran t-tests to test if performance for the subgroups was above chance. For all other tasks (with the exception of the causality task) we ran simple correlations between (centred) test scores and mean accuracy and RT, respectively.
In addition, we performed Bayesian analyses. The reason for this is that conventional significance tests are designed to reject the null hypothesis. However, if the null hypothesis is true, p-values do not converge to any limit value, and all p-values are all equally likely (Rouder, Speckman, Sun, Morey, & Iverson, 2009). Non-significant results therefore do not allow for inference of the truth of the null hypothesis (see e.g., Dienes, 2014). Bayesian analyses, in contrast, provide information about the strength of statistical evidence in favour of either the alternative hypothesis or the null hypothesis. Bayes factors provide the relative probability of the data under the two hypotheses. For example, a Bayes factor of 2 means that the data are two times more likely under the alternative hypothesis (H A ) than they are under the null hypothesis. Similarly, two statistical models can be compared directly with each other, and the strength of the evidence for one model (that includes a given main effect or interaction) over the other (that does not contain this effect or interaction) can be determined. An overview of a common textual interpretation of Bayes factor values is presented in Table 5.
We used Bayesian linear regression from the BayesFactor package (Morey, Rouder, & Jamil, 2015). This type of analysis allows comparing a number of different models and determining the model under which the data are most likely (that is, the model with the highest Bayes factor), and the incorporation of random factors (participant, item). In line with recommendations by Morey and Rouder (2011) we used a Cauchy prior with scale parameter 1/√2 for the standardised effect size. Cauchy priors are relatively wide and symmetric around zero, which means that the data quickly overwhelms the prior (Morey & Wagenmakers, 2014: 123). In addition, we used Bayesian t-tests (from the BayesFactor package) and Bayesian correlations from the BayesMed package (Nuijten, Wetzels, Matzke, Dolan, & Wagenmakers, 2014) to complement the traditional analysis outlined above.

Accuracy
Summary data, together with the adult comparison data, are shown in Fig.1.
The mean accuracy in the four-year-old group was 58.3%. The fiveyear-olds' mean accuracy was higher, at 63.2%. Adults responded correctly in 97.7% of all trials. The summary of the final (traditional) mixed-effects model is shown in Table 6. It shows that there were no significant main effects of AgeGroup, Type, or ClauseOrder, but there were significant interactions of AgeGroup and Type, AgeGroup and ClauseOrder, as well as a three-way interaction of AgeGroup, Type, and ClauseOrder. VerbType was not a significant factor. The significant interactions can be interpreted as follows. The five- year-olds performed significantly better than the four-year-olds with before-sentences (71.3% vs. 61.7%), and also with sentences in subordinate-main orders overall (69.4% vs. 61.6%). However, for beforesentences, the five-year-olds' performance with sentences in subordinate-main order was significantly worse than in main-subordinate order (66.7% vs. 76%). This means that the five-year-olds were generally better with sentences in iconic clause order (subordinate-main for after, because, if, and main-subordinate for before). Adults performed at ceiling. (Note, however, that there were a few errors with if-and because-sentences, which were due to one particular item. We return to this in the discussion.) The results of the GLMM were corroborated by the Bayesian analysis. The model under which the data were most likely included the same main effects and interactions (Bayes factor: > 60 million -"decisive evidence" -, compared to only the intercept). In fact, the data were 72 times more likely under the model that included the three-wayinteraction ("very strong evidence") than under the model that did not include this three-way-interaction. While the five-year-olds' performance with before-sentences in general and with all other types in subordinate-main order was clearly above chance, it is possible that the four-year-olds overall, and the fiveyear-olds in the main-subordinate conditions of the other connectives (after, because, if) were at chance levelwhich would explain the absence of a main effect of AgeGroup. We tested each age group's   Cognition 171 (2018) 202-224 performance in the eight conditions using one-tailed t-tests and Bayesian t-tests. As statistical significance in null hypothesis testing depends on the number of intended analyses, it is necessary to correct for multiple comparisons. Using Bonferroni-correction, adjusting for 18 comparisons (one for each condition, plus two overall) yielded a significance level of 0.05/18 = 0.0028. A correction for multiple comparisons is not necessary for Bayesian t-tests (Dienes, 2011). The results are presented in Table 7 (four-year-olds) and Table 8 (five-year-olds).
The t-tests show that the four-year-olds' performance overall was above chance, but this emerges only when all conditions are combined none of the individual sentence types were above chance after controlling for multiple comparisons. While the p-values are not statistically significant after correcting for multiple comparisons, the Bayes factors provide more information: They show that there is "anecdotal evidence" for above-chance performance with because-sentences in the four-year-olds, which is likely to be the reason for their above-chance performance overall. In addition, the Bayes factors show that there is "substantial evidence" for an at-chance performance of the four-year-olds in all after-sentences, in before-sentences in main-sub order, and in if-sentences in sub-main order, and "anecdotal evidence" for at-chance performance in before-sentences in sub-main order, and ifsentences in main-sub order. In addition, there is evidence that the fiveyear-olds' performance in main-sub ordered sentences was at chance for after-, because-and if-sentences.
In summary, four-year-olds showed only a very fragile understanding of complex sentences on this task. Five-year-olds showed a better understanding of sentences that were in iconic clause order, and for before-sentences overall.

Response times
For the analyses of RTs, only correct responses were analysed (N = 2441). After inspection of the data, we removed outliers using the following criteria: For children, we excluded all responses that were shorter than 300 ms and longer than 20,000 ms (99 responses, 5.9% of the data), as it is unlikely that shorter or longer RTs reflect processing of the target stimuli. For adults, we excluded all responses that were shorter than 150 ms and longer than 6000 ms (17 responses, 3.6% of the data). Overall, 68% of the data from the full data set were included (50% of the 4-year-olds' data, 59% of the 5-year-olds' data, and 94% of the adult data).
The RT data of all age groups are visualised Fig.2. The four-year-olds' mean response time was 5177 ms, the five-yearolds' was 3278 ms, and the adults' 1038 ms.
The summary of the final model for the child groups is shown in Table 9. In addition to random intercepts for participants and items the model also contained by-participant slopes for Type. ClauseOrder, and VerbType were not significant factors, but AgeGroup and Type were. There were no significant interactions.
The model was corroborated by the Bayesian analysis: The model under which the data were most likely was the one that contained only AgeGroup and Type as factors (Bayes factor for the data under this model: 5.7, "substantial evidence"). The data were about 19 times more likely under this model than under a model that also included ClauseOrder. This provides strong evidence that clause order was not a factor that affected children's response times.
Looking at the effects in the model in Table 9, it can be seen that the five-year-olds responded significantly faster than the four-year-olds. Furthermore, responses to because-and if-sentences were significantly slower than responses to after-and before-sentences.
The summary for the model for the adult control group is presented in Table 10. The only significant factor was Type. Adults responded to before-sentences significantly faster than to any other sentence-type. However, the Bayesian analysis indicated that the data is about four times more likely under a model with only Participant and Item as random factors than under a model that also contains Type as factor.
In summary, while neither VerbType nor ClauseOrder had an effect on participants' reaction times, Type had: Children had significantly slower responses with because-and if-sentences. For adults, it may be the case that before-sentences are responded to more quickly, but the results of the two analyses (traditional and Bayesian) are ambiguous.

Interim discussion
In the introduction, we presented four different theoretical accounts that have been put forward to explain and predict the processing of complex sentences. The semantic account predicts that children will perform better with iconic sentences, and that before-sentences will be acquired earlier. The syntactic account predicts that sentences in mainsubordinate orders are easier to process. The frequency-based account predicts that because-and if-clauses should be acquired earlier/more easily processed, and that for a given connective, performance should be better with the more frequently occurring clause order. In addition, sentences with transitive verbs should be easier than sentences with intransitive verbs. Finally, the capacity-constrained account predicts that individuals with better short-term memory skills should perform better generally.
In terms of accuracy, the results showed that while the four-year-olds performed above chance overall, they had only a fragile understanding of the complex sentences. The five-year-olds, in contrast, showed a much better understanding of sentences in iconic clauseorder, and of before-sentences overall. These findings thus support hypotheses 1 and 2 from the semantic account, (see Section 2.3 above), but not hypotheses 3, 4, 5, and 6 from the syntactic and the frequencybased account, respectively. In the next section, we now turn to the possible role of memory to test the prediction made by the capacity-constrained account (hypothesis 7). In addition, we investigate if individual variation in general language ability and/or executive function is related to complex sentence comprehension, and, if so, if it can explain any additional variance in the children's performance.

Other tasks
We first present descriptive statistics for all other tests that were administered. We then test if any of the scores in the memory, language, and executive function tasks are significantly (and with at least substantial evidence) correlated with mean accuracy and/or mean RTs. Those scores that are significantly, and with substantial evidence, correlated with these overall measures are then entered into the optimal statistical models obtained in the analyses above (see Section 3.2).
3.4.1. Descriptive statistics 3.4.1.1. Standardised language and memory tasks. The means and standard deviations of the standardised scores for the CELF and ERB sub-tasks for both age groups are presented in Table 11.
The means and standard deviations indicate that each group was performing at an age-appropriate level in all of the tasks.

Executive function tasks.
On the Day/Night task, out of a maximum of 12 correct trials, the mean in the four-year-old group was 11.3 correct responses (SD = 4.3), and 12 (SD = 4.1) in the fiveyear-old group. In the post-switch phase of the DCCS task, where a maximum of six correct trials are possible, four-year-olds achieved on average 3.6 correct (SD = 2.7), and five-year-olds 4.4 (SD = 2.4). It should be noted, however, that the means are not necessarily informative, because the distribution tends to be bi-modalchildren get all trials either wrong or rightwhich was also the case here. While the four-year-olds were approximately split between 0 and 6 correct responses, the majority of the five-year-olds got all trials correct (see Fig. D1 in Appendix D).
3.4.1.3. Causality task. In both age groups, the mode for correct trials was four (the maximum number of correct trials) indicating that the children showed an understanding of the temporal priority principle.

Correlations with mean accuracy and mean RT
We tested correlations between the z-scores of the language, memory, and executive function tasks and mean accuracy and mean RT scores using standard correlations and Bayesian correlations. The results (tables and corresponding scatterplots) can be found in Appendix D.
Of the six tasks, five were significantly positively correlated with mean accuracy: the CELF Linguistic Concepts score, the CELF Sentence Structure score, the ERB Preschool Repetition test score, the ERB Sentence Imitation test score, and the DCCS post-switch test score. Only the Day/Night score was not significantly correlated with mean accuracy. The Bayes factors obtained through the Bayesian correlation indicate that there was extreme evidence for a correlation with the CELF Linguistic Concepts score, substantial evidence for a correlation with the CELF Sentence Structure test score, and strong evidence for a correlation with the ERB Sentence Imitation. For the DCCS post-switch score, there was only anecdotal evidence for a positive correlation, while there was anecdotal evidence for no correlation between mean accuracy and the ERB Preschool Repetition score, and strong evidence for no correlation between mean accuracy and the Day/Night task score. Overall then, children who scored higher on one of the memory tasks (ERB Sentence Imitation) and the standardised language tests (CELF Linguistic Concepts, CELF Sentence Structure) showed better comprehension in the connective comprehension task than children who scored lower.
Three test scores were significantly negatively correlated with response times: the DCCS post-switch phase score, the CELF Linguistic Concepts test score, and the CELF Sentence Structure test score. However, there was strong evidence only for the correlation with the Linguistic Concepts score. The evidence for the correlation with the CELF Sentence structure test score and the DCCS post-switch phase score were only anecdotal. In addition, there was substantial evidence for the lack of a correlation between mean RTs and the Day/Night test scores, and the ERB Preschool Repetition test score. Thus, overall, only the CELF Linguistic Concepts score was strongly negatively correlated with the speed of responses, that is, higher CELF scores were correlated with faster response times.

Influence on accuracy and response times
On the basis of the results of the correlation tests, the CELF Linguistic Concepts score, the CELF Sentence Structure score, and the two ERB scores (Preschool Repetition and Sentence Imitation), which serve as indicators for short-term memory, were entered into the optimal model for the prediction of accuracy in the connective comprehension task (see Section 3.2.1). Recall that the capacity-constrained account predicts that memory capacity should make an independent contribution to children's performance in the comprehension experiment. Similarly, the CELF Linguistic Concepts score was added to the optimal model for the prediction of response times in the connective comprehension task (see Section 3.2.2).
Of the four predictors added to the Accuracy model, only one remained significant and was kept in the model: the CELF Linguistic Concepts score (see Table 12). However, the more complex models that included these additional factors did not converge, a problem that has been noted for mixed-effect models that have multi-level factors (Eager & Roy, 2017). The Bayesian analysis, which did not suffer from nonconvergence problems, suggested that the data were 1.5 times more likely under the original model than under the model that included the CELF Linguistic Concepts score ("anecdotal evidence"), and about 23 times more likely under the original model than under the one that included the two memory-related scores, ERB PSRep and ERB Sentence Imitation ("strong evidence"). (For a visualisation, see Fig. D4 in Appendix D.) Standardised memory or language ability scores thus did not explain any additional variation in the accuracy data, over and above the variation that was explained by the interaction of the experimental factors AgeGroup, Type, and ClauseOrder.
For response times, the CELF Linguistic Concepts score was a significant predictor. Children who scored higher on the language test had significantly shorter response times than children who scored lower (see Table 13), suggesting that there may be an independent contribution of general language ability to response times, although the data was about 1.8 times more likely under the Bayesian model without this additional predictor than under the one that included it, which suggests the contribution of the CELF scores to variation in reaction times may be relatively small. In summary, although several test scores were correlated with task performance (positively with mean accuracy, negatively with mean response times), none of those predicted any additional variance after accounting for the influence of the experimental factors. In particular, we did not find any evidence for an independent contribution of memory to performance in the connective comprehension task, disconfirming hypothesis 7.

Discussion
The aim of this study was to test hypotheses predicted by four different accounts regarding children's processing of complex sentences with the connectives after, before, because, and if. In what follows, we first argue that the data support the semantic account best. In the light of the results, we then go on to consider in more detail the role of semantic complexity on the one hand and input frequency on the other. Next we address the production-comprehension asymmetry suggested by our data, before discussing what the results say about the role of individual differences generally, and short-term memory in particular, in language comprehension. In the final part of the discussion, we lay out what it takes to construct a coherent mental model from complex sentences, relating the present research to the wider context of temporal-causal reasoning and the relationship between language and cognitive development.

Iconicity as the key factor in complex sentence comprehension
The children's performance in terms of accuracy is mostly consistent with Clark's (1971) semantic account. The five-year-old children showed a better understanding of sentences in which the order of events in the sentence matched the order of events in the real world (iconic sentences). In addition, they showed better comprehension of before-sentences compared to after-sentences, and in fact also compared to because-and if-sentences. Four-year-olds, in contrast, while being above chance overall, showed only a very limited understanding of complex sentences. Our results add to the growing body of evidence that children expect that language directly maps onto the events in the real world, and experience comprehension problems when this is not the case (Blything & Cain, 2016;Blything et al., 2015;Emerson, 1979;Feagans, 1980;French & Brown, 1977;Stevenson & Pollitt, 1987;Trosborg, 1982). Importantly, our study is the first one to extend this finding to both because-and if-sentences, suggesting that this is a general principle in children's processing of complex sentences, rather than one that is only employed with temporal clauses. It should be noted, however, that while the error-rates for non-iconic sentences were higher than those for iconic sentences, children did not consistently misinterpret non-iconic sentences as iconic; with the exception of before-sentences (which we discuss next), performance was at chance. This may indicate that children find non-iconic sentences un-interpretable, which leads them to choose randomly between two options, rather than imposing an iconic interpretation on every sentence.

Semantic complexity vs. input frequency
Also in support of Clark's semantic account, we found a clear facilitative effect for before-sentences, in both clause orders. However, we suggest that this is not due to differences in semantic features, but rather due to a confluence of factors, including frequency and syntactic form. Were it the case that children initially interpreted after-sentences as before-sentences, as suggested by Clark, they should have performed much worse on after-sentences than they did. Instead, these results could suggest that before has advantages over after in terms of both its semantic transparency, and how often it is used as a connective. Although both before and after are used more often in other constructions than as temporal connectives, the meaning of before is always either spatial ("to appear before the court") or temporal, with clear similarities between the two. The meaning of after, however, is often more opaque, as for example in phrasal verbs ("to look after", "to inquire after"). In addition, before is used in other constructions only about 1.5 times more often than as a temporal connective in complex sentences, whereas after occurs more than four times more often in other constructions in both adult written and spoken language, (Leech, Rayson, & Wilson, 2014), and in child-directed speech . In other words, before has a more consistent form-meaning mapping. For the parser, this means that there is more uncertainty attached to after with respect to the construction that is currently being processed, and as a consequence a higher chance of misanalysis. Children's superior performance with iconic before-sentences can then be explained by the fact that these combine a lower-uncertainty word (before) with an iconic clause order that is main-subordinate, unlike the other three connectives. Our results show clearly that syntactic form in terms of the distance between the subordinator and its resolution is not the determining factor in children's processing of complex sentences, contrary to the syntactic account's prediction. However, in combination with a more consistent form-meaning mapping and iconicity, the shorter recognition domain of the main-subordinate clause order may give iconic before-sentences an "edge" over the other sentence types. Iconic before-sentences are the only sentences that can be processed incrementally, without re-analysis. We are currently testing the hypothesis that a more consistent form-meaning mapping makes beforesentences easier for English children by conducting the same experiment in a language that is similar syntactically, but has different relative frequencies for using the different words as connectives: German. If the hypothesis is correct, the advantage of (non-iconic) before-typesentences should then disappear. If, on the other hand, the effect persists, this would support a semantic explanation along the lines of Clark (1971).
If (relative) frequency does have some role to play in complex sentence comprehension after all, then the question is: Why were children in our experiment not better at comprehending because-and ifclauses, which are much more frequent in English than after-and beforesentences? In the present study, the children showed in the causality task that they did understand that causes must precede effects, and the older age-group showed an understanding of because-and if-sentences in iconic order. But despite understanding some aspects of causality, performance was relatively low. Furthermore, children of both age groups were significantly slower in responding to because-and if-sentences compared to after-and before-sentences.
One possible explanation is due to the sentences' higher semantic complexity. Understanding isolated because-and if-sentences requires an understanding of both temporality and causality, purely through language, whereas before-and after-sentences rely on temporality only. Furthermore, causality may be semantically more complex than temporality: It has been observed that in production, children use the connective and to express semantic relations in the order of additive < temporal < causal < adversative (L Bloom, Lahey, Hood, Lifter, & Fiess, 1980), which have been said to be of increasing semantic complexity, following the notion of cumulative complexity introduced by Brown (1973). But if the cumulative complexity assumption holds also for comprehension, it remains unclear why there was no difference in accuracy between the semantically simpler after-sentences on the one hand, and the semantically more complex because-and if-sentences on the other. Interestingly, the response time data are in line with the assumption of cumulative complexity: Responses to because-and ifsentences were slower than to after-and before-sentences. This suggests that processing two clauses that are causally linked takes longer than processing clauses that are only temporally linked. There is thus an interesting disconnect between the accuracy data, which showed an advantage for iconic sentences, and for before-sentences in general, and the RT data, which showed an advantage for temporal clauses. It is possible that children perceive temporal sentences to be easier (and thus react more quickly), even if their actual levels of accuracy indicate comprehension difficulties, at least for (non-iconic) after-sentences. Processing causal sentences may take more time, but it does not necessarily lead to more errors. L.E. de Ruiter et al. Cognition 171 (2018) 202-224 4.3. Production-comprehension asymmetry An argument against the cumulative complexity account as an explanation is that children also start producing because-and if-sentences before they start producing after-and before-sentences (e.g., Diessel, 2004), suggesting that they find because-and if-sentences easier. Production-comprehension asymmetries raise interesting questions in language acquisition research, and different accounts have been put forward. (see e.g., Grimm, Müller, Hamann, & Ruigendijk, 2011). Here we suggest two possible explanations for this mismatch. First, it may be that producing because-and if-sentences in natural interaction puts different demands on children than comprehending them in an experiment. In spontaneous production, children go from intended meaning to form, all within a supporting linguistic and non-linguistic context (usually the here-and-now). They already know what the relation is between two events they want to express. They can also avoid more complex forms and use alternative strategies (e.g., stringing clauses together using "and then" to express temporal order, instead of an after-/before-sentence). In comprehension, and in particular in experiments that do not provide any additional context, children need to rely purely on form to understand the meaning (we discuss the requirements for constructing meaning below). Second, it may be that children are less familiar with because-and if-sentences being used to express physical causality. Recall that in everyday conversation, speakers use because-clauses primarily to give reasons for a preceding speech act ("You can't have sweets now because we're having dinner soon"), and if-clauses often provide a conceptual framework for a larger chunk of discourse ("If I ever win the lottery, I have plenty ideas of what to do with the money."). On the other hand, both experimental and observational studies have found that at least Dutch children are able to express content-type causality from three years onwards, suggesting this domain is not uncommon for young children (Evers-Vermeul & Sanders, 2011). Future studies should investigate how providing more context or using other types of causality affects children's comprehension of causal sentences.

Individual differences and memory
Turning now to the role of individual differences, we found that the accuracy data and the RT data showed similar patterns with respect to their relationship with individual measures of language ability, memory, and executive function (inhibition). Children with higher scores on these tasks achieved higher accuracy in the comprehension task, and responded more quickly. However, these factors did not explain any variation in performance after effects of age, type of sentence, and clause order were accounted for. In particular, we did not find any evidence for an independent contribution of memory, contrary to the predictions made by the capacity-constrained account. Note that not only did we not find any significant effect of memory; using a Bayesian approach, we found strong evidence against the role of memory and other measures in the models. It is possible that our measures (wordand non-word repetition and sentence imitation) did not capture the type of memory that is central to complex sentence comprehension. Blything et al. (2015) and Blything and Cain (2016), who observed a memory effect, used a digit-span task. However, in view of the fact that the researchers who originally proposed the memory capacity-constrained account measured memory capacity using reading span (Just & Carpenter, 1992), we believe that with children, sentence imitation (with sentences of increasing length) is a comparable measure. Against this background, our results do not provide evidence for a significant role of individual differences in memory, executive function, and general language ability in complex sentence comprehension. This contrasts with other studies that have found that variability in aspects such as working memory or executive function is associated with different language outcomes, even after controlling for age (e.g., Blything & Cain, 2016;White, Alexander, & Greenfield, 2017), but our findings are far from uncommon, as the picture is rather mixed (see Kidd, 2013 for a critical review of the role of working memory). Overall, our findings suggest that the ability to construct a coherent mental model from isolated complex sentences is not just a competence emerging from a combination of general language ability, memory, and executive function, but a distinct construct that cannot be captured with standardised tests.
What is this construct and how does it develop over time? We first discuss our results in relation to previous studies, before connecting them to the wider context of the development of temporal-causal reasoning, and the relationship between language and cognitive development.

Temporal-causal reasoning and the construction of mental event representations
In our data, four-year-olds showed only a rudimentary ability to process complex sentences in isolation, whereas the five-year-olds showed a more robustalbeit still incompleteunderstanding. For before and after, this contrasts with some previous studies, which found above-chance performance at a slightly younger age, between three and four years (e.g., Blything et al., 2015). We attribute this difference to the fact that the task required that the children consider two explicit alternatives ("story" A and "story" B) before making a selection. As we discuss below, this requires that the listener have a stable mental representation of the events, which she can handle flexibly to reason about temporal and causal relations between them. For because and if, our findings are more in line with those of Amidon (1976), who found above-chance performance in her youngest age group (five years), and not with those of Emerson (1979) and Emerson and Gekoski (1980), who found children to comprehend because-and if-sentences only around the age of eight years.
Research on children's capacity to reason (non-linguistically) about temporal and causal relations events using search and planning tasks has found that flexible temporal-causal reasoning develops around the age of five or six years (e.g., Lohse, Kalitschke, Ruthmann, & Rakoczy, 2015;McCormack & Hanley, 2011). The basic logic of the tasks is that participants need to mentally reconstruct or pre-construct a sequence of causally linked events in order to correctly infer a present or anticipated future state of the world (e.g., an object's location). While four-yearolds usually do not have problems understanding the temporal priority principle (Rankin & McCormack, 2013) as in the present study -, it appears that they cannot perform in these search and planning tasks unless under specific conditions, indicating that they lack the capacity to reason flexibly about temporal-causal relations. Specifically, younger children seem to be able to perform this task only when it refers to past events, but not when they have to mentally construct a sequence of events themselves to make inferences (McCormack & Hanley, 2011). Furthermore, younger children appear to require visible, positive evidence (e.g., a clear sign that an object had been used in a particular location) to infer a state of events (e.g., that the object must have been lost after it was used in that location). Older children, in contrast, can also use the absence of evidence to perform inferences (i.e., use counterfactual reasoning; Lohse et al., 2015).
How could this background help explain the difference between the current findings and those of Blything et al. (2015), who found that their youngest age group (three-to-four-year-olds) performed better with before-and after-sentences than the four-year-olds in the present study? In Blything et al.'s study, children watched short animated clips of the actions of both clauses of the complex sentence (e.g., eating a hotdog, putting shoes on) successively next to each other, which ended in a freeze frame. They then heard the prompt "Listen carefully and touch the thing Tom/Sue did first", followed by the sentence (e.g., "Before he ate the burger, he put on the sandals"). In contrast, in the present study, children first heard the prompt, followed by the sentence (e.g., "After she paints the old fence, she hoovers the house"), and then saw the two picture stories. The children in Blything et al.'s study were aware that they had to pay attention only to what happened first, and they knew what the two possible actions were before even hearing the sentence. The children in the present study had to first construct a mental representation of the chain of events from language only, without any initial visual support, and then needed to check this model against two possible laid out sequences. The research on temporalcausal reasoning outlined above suggests that creating a mental sequence "from scratch" may be challenging for four-year-olds, so we would expect those representations to be more fragile than those that are supported visually from the start, and may not yet be stable enough to reason about them in order to make a selection on the screen (e.g., "if this is what happens, then the story at the top must be the right one"). We suggest that the task used in the present study is actually a closer match to what listeners typically have to do: construct a mental model from the speech input alone, and use that model subsequently, for example to make a decision (e.g., "Before you do your homework, put your clothes in the laundry basket"what needs to happen now?).

The relationship between language and cognitive development
An important question arising from these different strands of research concerns the mutual influence of language and cognition. Is it the development of temporal-causal reasoning capacities that allows children to understand complex sentences describing chains of events in different ways (iconic and non-iconic)? Or is it children's situated language experience that leads them to develop more flexible representations of events? For example, a child may encounter a non-iconic sentence in a situation where the real-world context makes it clear what the order of events is ("Before you go to bed you need to brush your teeth"), which enables her to understand that language can describe events in non-iconic ways, which in turn leads to a more abstract and flexible understanding of how two (or more) events are linked. It seems likely that as in other areas of language and cognitive development (e.g. complex complement clauses and theory of mind, De Villiers, 2007), a bidirectional relationship exists with developments in each domain supporting the other.
In the context of causal reasoning, it is interesting to note that the few errors that the adults made in the present study occurred almost exclusively (eight out of eleven) with one item. The test sentence was "If/because she dives in the pool, she feels really warm", and the correct story was one showing the protagonist diving into a heated pool in a wintery landscape outside and enjoying the warmth, whereas the foil sequence shows her standing in the sun in the summer and then diving into a (cold) pool. It appears that several adult participants interpreted the sentence in an epistemic way, in the sense of "If she dives in the pool then that must mean that she's feeling warm", which makes the foil the better match. This item did not stand out from the other items in the children's data, which suggests that this epistemic interpretation may not yet have been open to them. This would be in line with corpus studies of English, French, and Dutch child language, which have found that subjective causal relations appear later than objective relations (e.g., Evers-Vermeul & Sanders, 2011;Zufferey, Mak, & Sanders, 2015).
The five-year-olds in the present study were still far from adult-like in their performance. It is clear that complex sentence comprehension must undergo substantial development throughout the school years.
School education, and literacy training in particular, is likely to contribute to this development. Children are exposed to written texts and taught to pay attention to elements that link clauses and sentences with each other in order to understand the meaning of a text. This will also impact their spoken language comprehension. Furthermore, children will develop their understanding (and production) of other forms of causal language, in particular epistemic language. At this point it is still unclear what the role of the input (either spoken or written) may be in children's development of different forms of causal language.
This study investigated the role of syntax, semantics, frequency, and working memory in the comprehension of complex sentences involving adverbial clauses. To limit the availability of additional cues to meaning and therefore provide a relatively pure test, sentences were deliberately presented with minimal contextual support. Of course, in reality, complex sentences are typically used in discourse, and thus another question concerns how their processing is affected by information structure, or discourse pragmatics. It has been found that adult listeners find sentences in which given information precedes new information easier to process (Haviland & Clark, 1974) and there is an indication that young children (three to five years) prefer a given-before-new order in when-sentences containing a main and subordinate clause (Junge, Theakston, & Lieven, 2015). An interesting avenue for future studies would be to explore how information structure affects children's comprehension of different types of complex sentences, and to what extent such an effect may interact with the effect of iconicity that we found in our study.

Summary
In this paper, we provide the most comprehensive experimental study to date to evaluate four theoretical models of the factors underpinning children's abilities to comprehend complex sentences containing adverbial clauses. We found that children's comprehension was strongly influenced by semantic factorsthe iconicity of the event-tolanguage mappingsand their response times were influenced by the type of relation expressed (temporal vs. causal). We found that neither input frequency (frequency-based account), nor clause order (syntax account) or working memory (capacity-constrained account) provided a good fit to the data. Our findings thus contribute to the development of more sophisticated models of sentence processing to apply through acquisition and into adulthood. Although the stimuli used in the present study were deliberately designed to be challenging, we would argue that they reflect the demands placed on children in everyday life, especially in academic contexts. We conclude that models of linguistic processing and representation must take into account how children's emerging linguistic understanding interacts with developments in other cognitive domains such as their ability to construct mental models and reason flexibly about them.

Appendix A Table A1
Experimental sentences for the experimental Lists 1 and 3. Note that in List 3, all after-sentences from List 1 have been changed to before-sentences, and vice versa. In the same way, all because-sentences from List 1 were changed to if-sentences in List 3, and vice versa. Experimental lists 2 and 4 were created by swapping session 1 and 2 of List 1 and List 3, respectively.

Session
Sentence No. Sentence List 1 Sentence List 3 1 1 After she paints the old fence, she hoovers the house. Before she paints the old fence, she hoovers the house. 2 After he sweeps the new floor, he watches TV.
Before he sweeps the new floor, he watches TV. 3 He drinks some water, after he eats a green pear. He drinks some water, before he eats a green pear. 4 He laughs really hard, after he coughs a few times. He laughs really hard, before he coughs a few times. 5 She hides over there, after she runs over here. She hides over there, before she runs over here. 6 After she dances around, she bounces away.
Before she dances around, she bounces away. 7 Before he reads his new book, he plays his big drum.
After he reads his new book, he plays his big drum. 8 She takes a hot bath, before she draws a picture. She takes a hot bath, after she draws a picture. 9 She breaks her small train, before she builds a tower.
She breaks her small train, after she builds a tower. 10 She hops up and down, before she crawls on the floor. She hops up and down, after she crawls on the floor. 11 Before he shouts out loudly, he drives away fast.
After he shouts out loudly, he drives away fast. 12 Before he waves happily, he swims on his back. After he waves happily, he swims on his back. 13 Because she bangs her head hard, she closes her eyes.
If she bangs her head hard, she closes her eyes. 14 Because he opens the door, he sees the snowman.
If he opens the door, he sees the snowman. 15 He misses the bus, because he rides his old bike.
He misses the bus, if he rides his old bike. 16 He cries really hard, because he trips suddenly.
He cries really hard, if he trips suddenly. 17 She feels really warm, because she dives in the pool. She feels really warm, if she dives in the pool. 18 Because she looks at the sky, she slips to the ground.
If she looks at the sky, she slips to the ground. 19 If he sings a happy song, he wins a nice cup.
Because he sings a happy song, he wins a nice cup. 20 She finds her other shoe, if she cuts the long grass. She finds her other shoe, because she cuts the long grass. 21 She hears the doorbell, if she presses the button. She hears the doorbell, because she presses the button. 22 She wakes up in the night, if she talks to herself. She wakes up in the night, because she talks to herself. 23 If he sits down in his chair, he gets very bored.
Because he sits down in his chair, he gets very bored. 24 If he sneezes lots of times, he falls in the field.
Because he sneezes lots of times, he falls in the field. 2 1 She hoovers the house, after she paints the old fence. She hoovers the house, before she paints the old fence. 2 He watches TV, after he sweeps the new floor.
He watches TV, before he sweeps the new floor. 3 After he eats a green pear, he drinks some water. Before he eats a green pear, he drinks some water. 4 After he coughs a few times, he laughs really hard. Before he coughs a few times, he laughs really hard. 5 After she runs over here, she hides over there Before she runs over here, she hides over there 6 She bounces away, after she dances around. She bounces away, before she dances around. 7 He plays his big drum, before he reads his new book. He plays his big drum, after he reads his new book. 8 Before she draws a picture, she takes a hot bath.
After she draws a picture, she takes a hot bath. 9 Before she builds a tower, she breaks her small train.
After she builds a tower, she breaks her small train. 10 Before she crawls on the floor, she hops up and down.
After she crawls on the floor, she hops up and down. 11 He drives away fast, before he shouts out loudly. He drives away fast, after he shouts out loudly. 12 He swims on his back, before he waves happily. He swims on his back, after he waves happily. 13 She closes her eyes, because she bangs her head hard. She closes her eyes, if she bangs her head hard. 14 He sees the snowman, because he opens the door. He sees the snowman, if he opens the door. 15 Because he rides his old bike, he misses the bus.
If he rides his old bike, he misses the bus. 16 Because he trips suddenly, he cries really hard.
If he trips suddenly, he cries really hard. 17 Because she dives in the pool, she feels really warm.
If she dives in the pool, she feels really warm. 18 She slips to the ground, because she looks at the sky. She slips to the ground, if she looks at the sky. 19 He wins a nice cup, if he sings a happy song. He wins a nice cup, because he sings a happy song. 20 If she cuts the long grass, she finds her other shoe.
Because she cuts the long grass, she finds her other shoe. 21 If she presses the button, she hears the doorbell.
Because she presses the button, she hears the doorbell. 22 If she talks to herself, she wakes up in the night.
Because she talks to herself, she wakes up in the night. 23 He gets very bored, if he sits down in his chair. He gets very bored, because he sits down in his chair. 24 He falls in the field, if he sneezes lots of times. He falls in the field, because he sneezes lots of times.