An experimental reassessment of complex NP islands with NP-scrambling in Japanese

There is little consensus in the Japanese syntax literature on the question of whether complex NPs with a noun complement headed by toyuu ‘that.say’ are islands for NP-scrambling dependencies. To explore this question, we conducted two acceptability judgment experiments using the factorial definition of islands to test the island status of noun complements, relative clauses (which are complex NPs, but uniformly considered islands in the literature), and coordinated NP structures (which are also uniformly considered islands in the literature). Our first experiment yielded clear evidence that relative clauses and coordinated NPs are islands, and that noun complements are not. Our second experiment replicated the relative clause and coordinated NP results, but yielded an inconclusive null result for noun complements. Taken together, our results suggest either that noun complements are not islands, or that they exhibit island effects that are subliminal and extremely small effects. We also investigated betweenand within-participant variability in our results. We observe no evidence of increased between-participant variability for noun complements relative to other islands, and no increase of within-participant variability for noun complements relative to grammatical NP-scrambling, thus corroborating our conclusions. Our results have consequences for a number of issues that have been encoded in current syntactic theories of island effects, including the correlation between syntactic constituent complexity and island status (e.g., number of bounding nodes or phase heads), and the correlation between complementizer deletion and island status (e.g., the complement/adjunct distinction).


Introduction
There is considerable debate in the literature about the island status of complex NPs in Japanese NP-scrambling constructions. For example, Haig (1976) claims that complex NPs with a complement clause headed by toyuu 'that.say' (henceforth, noun complements) are not islands for NP-scrambling, whereas relative clauses are islands for NP-scrambling. In contrast, Saito (1985) claims that both noun complements and relative clauses are islands for NP-scrambling, but that the island effect of noun complements is smaller than the island effect of relative clauses. Recent experimental work has only seemed to add to this debate. Yano (2019), as part of a broader study of the effect of D-linking on islands in Japanese, tested noun complements (but not relative clauses) in two acceptability judgment experiments using the factorial definition of island effects. However, the two experiments investigating noun complements with non-D-linked NPs produced contradictory results: the first experiment revealed a (relatively small) island effect, whereas the second experiment revealed no island effect (see section 2 for additional discussion). This suggests a need for additional systematic data collection with both noun complements and relative clauses. Therefore, in this study, we present two additional judgment studies specifically designed to explore complex NP islands with NP-scrambling in Japanese (Experiments 1 and 2). We use the factorial definition of island effects to explore the status of NP-scrambling out of both noun complements and relative clauses, and for additional comparison of the size of the island effects (following Saito's 1985 suggestion), we also include NP-scrambling out of a coordinate structure, which is uncontroversially considered an island for NP-scrambling in Japanese (Harada 1977).
Our results provide support for previous studies in that relative clauses as well as coordinate structures are shown to yield large island effects with NP-scrambling in Japanese. However, our results for noun complements are less straightforward: while Experiment 1 yielded conclusive evidence that they are not islands, Experiment 2 yielded inconclusive results. These findings suggest either that noun complements in Japanese are not islands or that noun complements in Japanese yield subliminal island effects that are extremely small; too small to reliably detect even with large sample sizes. Under the latter possibility, island effects with noun complements are qualitatively distinct from relative clause island effects, which exhibit characteristics of typical island effects, i.e., with a negative mean value for the island condition sentences and large enough effect sizes that would be detectable at smaller sample sizes. Thus, the results of our experiments show that there is a clear difference between relative clauses and noun complements, at least in Japanese, and more broadly, between noun complements in Japanese and noun complements in other languages that have been tested using the factorial definition, such as English, Italian, and Norwegian (cf. Sprouse et al. 2011;Sprouse et al. 2012;Sprouse et al. 2016;Kush et al. 2018). We argue that our findings have direct consequences for most existing theories of island effects, for theories of the relationship between complementizer deletion and island status, and for theories of the relationship between the complexity of syntactic structure and island status, and we suggest that future studies should probe the properties of relative clauses and noun complements (cross-linguistically) along these dimensions.

Complex NPs and NP-scrambling in Japanese
In this section we provide a brief description of two types of complex NPs in Japanese that are the topic of this study and review previous claims about complex NP islands with NP-scrambling in Japanese.

Noun complements and relative clauses in Japanese
As the main empirical goal of the current study is to compare scrambling out of noun complements and relative clauses in Japanese, some discussion of the syntactic properties of noun complements and relative clauses in Japanese is in order.
Noun complements in this study are complex NPs headed by toyuu 'that.say' and nouns such as uwasa 'rumor' and shooko 'evidence' as in (1a-b) below.
(1) a. Taro There are at least two other types of complex NPs in Japanese: those that are headed by no (2a) and those that are headed by koto (2b).
(2) a. Taro This study focuses on toyuu complex NPs as in (1) for the following reasons. First, as discussed in the introduction, previous studies on scrambling out of complex NPs in Japanese focused on toyuu complex NPs presumably because they represent the Japanese equivalents of noun complements in English (e.g., Nakau 1973). Second, although all three types of complex NPs in (1) and (2) exhibit the basic syntactic properties of NPs, such as being marked by a case marker and functioning as subjects and direct objects, there is a clear sense that the nouns like uwasa 'rumor' and shooko 'evidence' in (1) are lexical while no and koto in (2) are functional insofar as uwasa 'rumor' have a clear and identifiable meaning, no and koto do not. According to Kuno (1973), the semantic/pragmatic contribution of no and koto is that the complex NPs headed by them represent propositions that the speaker presupposes to be true, suggesting that no and koto have come to assume specific functions. Relatedly, lexical nouns like uwasa 'rumor' can stand alone as full NPs, whereas no and koto cannot. Under the assumption that noun complements involve the head noun taking a clausal complement, the head noun in a noun complement must be a lexical item with the ability to thematically license its complement. Lexical nouns like uwasa 'rumor' and shooko 'evidence' fit this description, while no and koto do not.
Relative clauses are differentiated from noun complements in the following ways. First, unlike noun complements that involve nouns selecting a clausal complement, relative clauses are modifiers that delimit the reference of the modified NP (e.g., Andrews 2007). Thus, there is no thematic relation between a relative clause and the NP it modifies. Second, unlike noun complements whose embedded clauses represent complete propositions, as can be seen in (1a-b), relative clauses typically involve a gap, which is interpreted as co-referential with the modified NP.
(3) a. Taro Third, evidence suggests that the structure of relative clauses in Japanese is less complex along certain dimensions than the structure of the embedded clauses inside noun complements. Tomioka (2015) points out that noun complements can embed a topic NP marked by -wa (4a) while relative clauses cannot (4b).
(4) a. It has also been noted that relative clauses allow a non-episodic interpretation of verbs that normally entail a change-of-state, as in (5a) (e.g., Teramura 1982;Ogihara 2004 (5a) is ambiguous between two interpretations. One interpretation is that what Taro asked is for a dry towel, while the alternative interpretation is that he asked for a dried towel, a towel that was wet at some point in the past, but it had dried. In (5b), in contrast, the embedded sentence can only have the latter interpretation, i.e., a towel underwent a change of state from 'not dry' to 'dry.' In this study, to clearly differentiate noun complements and relative clauses from each other, examples of noun complements always involve a noun preceded by an embedded clause marked by toyuu 'that.say,' while examples of relative clauses are never marked by toyuu 'that.say' and always involve a gap that is identified with the modified NP. 1 1 Relative clauses with a gap can also be followed by toyuu 'that.say'  Haig (1976) was one of the first theoretical studies to investigate complex NPs in Japanese, reporting that NP-scrambling out of a noun complement is acceptable (6a), while NP-scrambling out of a relative clause is not (6b). The judgments in (6) are from Haig (1976 Saito (1985;1987) made the more nuanced claim that both noun complements and relative clauses are islands, with noun complements being relatively more acceptable (7a) than relative clauses (7b). The judgments in (7) are from Saito (1985 (Saito 1985: 246;(146a)) To the best of our knowledge, Yano (2019) is the first and only study to examine the acceptability of NP-scrambling out of complex NPs (specifically, noun complements) with formal acceptability judgment experiments. The goal of Yano (2019) was to examine whether D-linked NPs like sono shoosetsu 'that novel' undergo syntactic movement when they appear in a fronted position. Yano (2019) uses island effects as a diagnostic of movement. To that end, Yano tested two island types: adjunct islands and noun complement islands. Yano (2019) tested both D-linked NPs (with sono 'that') as the target of investigation, and non-D-linked NPs (without sono 'that') as a baseline comparison. Here we focus exclusively on non-D-linked NPs as the effect of D-linked phrases, or lack thereof, is typically considered a separate topic of investigation (see Szabolcsi & Lohndal 2017 for a review of selective islands).
In the first experiment of Yano (2019), the sentences were presented in isolation. In the second experiment, the sentences were presented with a context sentence such as "The novel received the Naoki prize." to establish the fronted object in the discourse. 2 Yano (2019) used the factorial definition of island effects in which the presence of an island effect appears as a superadditive interaction of two (or more) factors that are themselves independent of the island ('the smell of someone broiling a mackerel.') effects (Sprouse 2007;Sprouse et al. 2011;Sprouse et al. 2012, a.o). For the Yano (2019) experiments these factors were STRUCTURE, manipulating the structure of the embedded clause (either an island or a non-island), and WORD ORDER, manipulating the presence or absence of scrambling out of the embedded clause. An example set of the four conditions is given in (8)  believe-NPST 'The commentator believes the news that the ghost-writer wrote the novel last year.' (Yano 2009: 5;(9)) The results of the two Yano (2019) experiments are equivocal. In the first experiment (no context), Yano found a small superadditive interaction indicative of a noun complement island effect. In the second experiment (with context), Yano found no superadditive interaction indicative of a noun complement island effect. Similar to the disagreement between Haig (1976) and Saito (1985;1987), this leaves the status of noun complements under debate. One complicating issue is that the Yano (2019) results also showed very low acceptability even for grammatical scrambling out of non-island embedded clauses (a declarative CP). Yano notes that this could be due to a preference in Japanese that scrambled NPs be longer than the NPs that they are scrambled over; the long-before-short preference (Dryer 1980;Hawkins 1994;Yamashita & Chang 2001;Yamashita 2002;Omaki et al. 2020). The scrambled NPs in the Yano (2019) experiments are single word NPs, which could have pushed the acceptability down. This in turn could have reduced the size of the superadditive interactions (if the long-before-short preference is not additive with island effects, which is itself a potentially interesting observation that might merit future study).
The contradictory results for noun complements between Haig (1976) and Saito (1985;1987), and between the two experiments in Yano (2019), suggest that additional systematic data collection is needed. To that end, here we report the results of two formal acceptability judgment experiments testing whether NP-scrambling out of complex NPs invoke island effects. Scrambling dependencies are conventionally used to test island effects in Japanese due to the lack of overt whmovement in these languages (for a discussion of island effects involving wh-in-situ, see Sprouse et al. 2011;Kim & Goodall 2016;Tanaka & Schwartz 2017;Lu et al. 2020). But before we move on to discuss our experiments, a caveat is in order concerning some characteristics of scrambling and the design of our experiments. First, it has been argued that some instances of scrambling exhibit properties of A-bar-movement while others exhibit properties of A-movement. In our study, all instances of scrambling are long distance, i.e., they always cross a clausal boundary. Since the consensus in the literature is that long distance scrambling is A-bar-movement (e.g., Saito 1992;Yoshimura 1992;Nemoto 1993;Tada 1993;see Nemoto 1999 for a comprehensive review of the relevant literature), we assume that all the instances of scrambling examined in this study are instances of A-bar-movement. Second, unlike wh-movement and relativization, scrambling is optional and has no semantic consequences (e.g., Saito 1989; but see Miyagawa 2001 for a claim that local scrambling can be triggered by the EPP and, therefore, obligatory). The fact that scrambling is an optional process raises questions about its motivations. Factors such as constituent weight (Yamashita and Chang 2001;Omaki et al. 2020) and information-structure status (Koizumi & Imamura 2017) have been shown to affect the production, acceptability, and processing of scrambling sentences. The optional nature of scrambling raises the possibility that the effect of scrambling on acceptability judgments in non-island environments might be different from the effects of wh-movement and relativization on acceptability judgments in similar environments. However, examining whether or not the effect of scrambling is qualitatively and quantitatively different from that of the other types of A-bar dependencies requires a larger experiment that compares different types of A-bar dependencies. Therefore, it is beyond the scope of this study, which was designed to detect island effects with NP-scrambling. Finally, there is one important difference between our experiments with NP-scrambling and previous studies that investigated wh-questions. The design of previous studies on wh-question such as Sprouse et.al (2011Sprouse et.al ( , 2012Sprouse et.al ( , 2016 and Kush et al. (2018Kush et al. ( , 2019 manipulated the distance of wh-movement dependencies, with wh-movement originating in either the matrix clause (short) or embedded clause (long). As discussed in detail below, in our experiments, as well as in Yano (2019), what was manipulated is the presence or absence of long distance scrambling, not the distance of the scrambling. This is because an instance of NP-scrambling is unambiguously A-bar-movement only if it is long distance -thus, one cannot compare instances of short and long scrambling sentences without introducing yet another factor that might affect their acceptability. We therefore note that the presence of scrambling may incur a larger main effect, as the mere presence of a long distance dependency alone has been shown to cause a significant decrease in acceptability compared to sentences without such a dependency (e.g., Kluender & Kutas 1993). A potentially larger main effect in turn raises the possibility that the superadditive interaciton might cause a floor effect; we discuss this possibility as part of description of the logic of the design in Section 3.1 below.

The logic of the design
Experiment 1 tested three island types: noun complements, relative clauses, and coordinated NP structures. By including both types of complex NPs together in the same experiment, we can investigate the Saito's (1985;1987) claim that noun complements yield smaller island effects than relative clauses. We included coordinated NP structures because they are uncontroversially considered islands in the literature (Harada 1977), and therefore could serve as a type of baseline comparison for the complex NPs.
We employed the factorial definition of island effects, both because we believe it matches the logic that has historically been used by syntacticians to define island effects, and because it allows us to eventually integrate our results with the growing cross-linguistic experimental literature using the factorial definition (a.o., Christensen et al. 2013;Almeida 2014;Kim & Goodall 2016;Sprouse et al. 2016;Kush et al. 2018;2019;Stepanov et al. 2018;Tanaka & Schwartz 2018;Ko et al. 2019;Lu et al. 2019;Tucker et al. 2019;Omaki et al. 2020). As described below, we implement the factorial design completely within participants, allowing us to quantify to what extent each participant reports an island effect, so that we can investigate a conjecture motivated by discussion in Yano (2019) that noun complements may show a high degree of variability across participants. Finally, we use relatively long scrambled NPs to satisfy the longbefore-short preference.
The factorial design has two factors: SCRAMBLING manipulates the presence or absence of NP-scrambling (no-scrambling/scrambling), and STRUCTURE manipulates the structure of the embedded clause (non-island/island). Fully crossing these two factors in a 2×2 design leads to four conditions. In (9), we illustrate all four conditions for noun complements. Note that the NPs that are the target of scrambling are outlined with a box.
(9) Example conditions for noun complements a. non-island / no-scrambling Shinjin-no kisha-wa [CP IT-gaisha-no shachoo-ga novice-GEN reporter-TOP [CP IT-company-GEN CEO-NOM chuumoku-no kakkitekina sofutowea-o daigaku.zaigakuchuu-ni kaihatsushi-ta-to] popular-GEN innovative software-ACC college.days-in develop-PST-COMP] kiji-ni kai-ta. article-as write-PST 'That novice reporter wrote (as an article) that the CEO of the IT company developed the popular, innovative software while s/he was in college.' kai-ta. college.days-in develop-PST-that.say article-ACC] write-PST 'That novice reporter wrote an article that the CEO of the IT company developed the popular, innovative software while s/he was in college.' kai-ta. article-ACC] write-PST (10) and (11) below show all four conditions for relative clauses and coordinated NP structures, respectively. The non-island structure that we chose for relative clauses was a declarative CP. The non-island structure that we chose for coordinated NP structures was an NP-PP sequence.
(10) Example conditions for relative clauses a. fune-de hikkoshisaki-ni okut-ta-to setsumeeshi-ta. together ship-by new.address-to send-PST-COMP explain-PST The value of the factorial definition is that it isolates the island effect in the interaction between SCRAMBLING and STRUCTURE (while subtracting out the main effects of those factors). If there is no island effect, we expect to see no interaction as illustrated in the left panel of Figure 1, where the two lines that connect the two means for the island condition sentences and the non-island condition sentences are parallel. If there is an island effect, we expect to see a superadditive interaction as illustrated in the center and right panels, where the two lines are not parallel because the mean for the scrambling/island condition sentence is lower than expected if the effects of the two manipulations are all there are. Crucially, we can also look at the size of the interaction as a measure of the size of the island effect (e.g., to test the claim by Saito 1985;1987); the center panel illustrates a smaller effect, and the right panel illustrates a larger effect. As discussed in Section 2.2., one factor that makes our experiments different from previous studies that examined other types of A-bar dependencies is that our second factor is about the absence versus presence of an A-bar dependency, i.e., NP-scrambling, while previous studies manipulated the distance of the dependency (e.g., wh-movement that originated in the matrix versus embedded clause). Because of this difference, our experiments might show a larger main effect of scrambling than is observed in previous studies. Figure 2 demonstrates this. The larger main effect than in Figure 1 appears as a steeper downward slope in the non-island structure line. One concern that arises with large main effects is that they make a floor effect likely with superadditve interaction terms. A floor effect arises when the superadditive interaction is so large that it should push the island/scrambling condition beneath the bottom edge of the scale, but since the scale does not go any lower, the island/scrambling condition becomes metaphorically pinned to the floor of the scale. The end result is an underestimation of the island effect size. Though we cannot eliminate this possibility, we can check for the possibility of floor effects by plotting the rating of the least acceptability filler in each plot as a solid gray line as an estimate of the functional floor of the scale. If the island/scrambling condition is lying on this line, then a floor effect is possible (though not certain). If the island/scrambling condition is above this line, then a floor effect is not possible.

Materials and survey construction
Each participant completed a survey that consisted of 58 items: 6 practice items, 12 experimental items and 40 filler items pseudorandomized to avoid related experimental items appearing in succession. The 12 experimental items consisted of 1 token of each of the 4 conditions for each of the three islands. We chose one judgment per condition per participant to keep the total number of experimental items low to minimize the chance that participants would notice the goal of the experiment. We compensated for the increased risk of noise with one judgment per condition by testing a sample size (n = 89) that is likely to yield very high statistical power for medium and large effect sizes, and moderate statistical power for small effect sizes . We created 8 lexically matched sets of items per island. The items were then distributed among 8 experimental lists using a Latin square procedure so that participants saw a unique lexical item in each condition. We identified 4 errors in the item codes (out of 96 items across lists) after the experiment. We corrected these errors during analysis, but it meant that the total number of observations per condition were mildly uneven.

Participants and presentation
Ninety-one participants from two universities in Tokyo, Japan, participated in the experiment. We excluded two participants from analysis because their answers to our language background questionnaire revealed that they had significant exposure to a language other than Japanese before they were 10-years old. Eighty-nine self-reported native speakers of Japanese remain in the analysis. Participants either received course credit for their participation or 500 yen. The experiment was administered online using IBEX (Drummond 2013). Each sentence was presented one at a time on its own presentation screen with a 1 (mattaku fushizen 'completely natural') to 7 (mattaku shizen 'completely unnatural') scale. Participants indicated their rating by clicking on the appropriate number. Because complex NP islands may show variability across participants, we did not exclude any participants from analysis.

Results
In this section we describe the results of Experiment 1, with a particular focus on (i) the presence or absence of the superadditive interaction indicative of island effects (and, relatedly, the relative size of the effect) and (ii) the variability of island effects across participants.

The presence or absence of island effects
To determine the presence or absence of island effects, we will look for two properties: (i) a visual pattern indicating a superadditive interaction among the four conditions in the factorial design (as illustrated in Figure 1), and (ii) statistical corroboration of the superadditive interaction. To assess the visual patterns in the results, Figure 3 reports the means and estimated standard errors (±1) for each condition, arranged in an interaction plot. We present the results two ways. The first is as zscore transformed scores (by participant), which reduces the impact of common forms of scale bias among the participants. The second is as raw results from the 7-point scale. Though we believe that the z-score transformed scores are likely the best option for analyzing acceptability judgments, as one anonymous reviewer points out, the raw scores allow us to evaluate the effect of the z-score transformation. The one caveat is that we must be sure not to exercise researcher degrees of freedom by selecting the results that we prefer. For this project, there is no risk -the z-score transformed scores and raw scores yield the same results.
For statistical corroboration, we conducted two types of analyses: one in a null hypothesis testing framework and one in a Bayesian framework. For the null hypothesis test, we constructed linear mixed effects models with SCRAMBLING and STRUCTURE as fixed effects and participant and item as random effects (intercepts only) for each island type using lme4 package in R (Bates et al. 2015). We calculated p-values using the lmerTest package, which uses the Satterthwaite approximation for degrees of freedom (Kuznetsova et al. 2017). The full set of results are reported in the appendix. We will interpret p-values below the conventional threshold of .05 as evidence against the null hypothesis, and therefore by implication, corroboration of the presence of an island effect. We will interpret p-values above the conventional threshold of .05 as a failure to reject the null hypothesis. Because the failure to reject the null hypothesis cannot be interpreted as evidence in support of the null hypothesis (because the null hypothesis is assumed to be true in the calculation of p-values), we include a Bayesian analysis to directly evaluate the null hypothesis.
For this Bayesian analysis, we calculated Bayes factors for the interaction term for the fixed effects in the linear models using the BayesFactor package (Morey & Rouder 2018). The Bayes factors reported here are of the BF10 type: they report the ratio of the likelihood of the data under the experimental hypothesis (H1) that an interaction is present to the likelihood of the data under the null hypothesis (H0) that there is no interaction present. Following Jeffreys (1939Jeffreys ( /1961), we will interpret a BF10 greater than 3 as strong evidence that an interaction is present, as this indicates that the data is at least 3x more likely under a theory in which the interaction is present than one in which the interaction is absent. Similarly, we will interpret a BF10 less than 0.33 as strong evidence that there is no interaction, as this indicates that the data is 3x more likely under the null hypothesis that the interaction is absent. We will also interpret Bayes factors between 0.33 and 3 as inconclusive (as the data is equally likely under both theories). In Figure 3, we have added the interaction term p-value and interaction BF10 to each cell of the plot so that the visual patterns and statistical results can be evaluated simultaneously. As Figure 3 makes clear, in Experiment 1, we see clear evidence of island effects with both relative clauses and coordinated NP structures -the visual pattern suggests a superadditive interaction, and both statistical analyses corroborate the interaction. However, for noun complements, we see no visual pattern of an interaction. The p-value is substantially above the conventional threshold, suggesting a failure to reject the null hypothesis. The Bayes factor is 0.43, which is close to the conventional threshold of 0.33, and suggests that the data is about 2.5x more likely under the null hypothesis than under the experimental hypothesis. (The Bayes factor for the raw judgments is at 0.33, but we believe z-scores are the more principled analysis, and therefore focus on the statistical results for z-scores to avoid the appearance of leveraging researcher degrees of freedom to our benefit.) We also note that the mean rating of the island violating condition for noun complements is above the middle of the raw scale (above 4) and right at the middle of the z-score scale, which represents the mean judgment of all the items in the experiment (target conditions and fillers). This is noticeably different than the mean ratings for the island violating conditions for relative clauses and coordinated NPs, as both are substantially below the midpoint of both the raw and z-score scales. We thus conclude that Experiment 1 suggests that noun complements are not islands for NP-scrambling, while relative clauses and coordinated NP structures are. (We also note that there is no evidence of floor or ceiling effects as the mean ratings of all conditions lie below the means of the highest rated fillers and above the means of the lowest rated fillers.)

Variability in island effects between participants
One possibility raised by Yano (2019) to explain the contradictory results for noun complements across the two experiments that he reports is that the island status of noun complements may show more between participant variability than other island types. To investigate this, we calculated the size of the island effect reported by each participant as a differences-in-differences (DD) score (Maxwell & Delaney 2003): (island/scrambling -non-island/scrambling) -(island/no-scrambling -non-island/no-scrambling). These DD scores will be negative when the participant shows a superadditive interaction indicative of an island effect, with the magnitude indicating the size of the effect; these DD scores will be 0 when the participant shows no interaction, and positive if the participant shows a pattern in which the island-violating condition is more acceptable than the main effects of structure and scrambling would predict (this latter case is not predicted by any theory, so may be indicative of noise in that participant's responses). Figure 4 reports the distribution of island effect sizes by participant as measured using DD scores for both z-scores and raw scores using histograms overlaid with probability density estimates. Figure 4: Experiment 1. The distribution of island effect sizes by participant, calculated as differences-in-differences scores for both z-scores and raw scores. The solid line is an estimate of probability density.
One clear sign that noun complements are more variable than the other islands would be for the distribution for noun complements to be wider than the distributions for the other islands. However, this is not what we see in Figure 4. If anything, noun complements show a narrower distribution. What we see instead is that noun complements show a relatively normal distribution centered exceedingly close to 0, as expected if there is no island effect, while the other two islands show distributions that are substantially shifted toward the negative range, as expected if there is an island effect. We therefore conclude that it is unlikely that noun complements show more betweenparticipant variability than the other island types. Instead, we see further corroboration from the relatively normal, and relatively narrow, distribution for noun complements that there is no island effect.

The logic of the design
For Experiment 2, we modified the design in two ways. First, we increased the number of tokens that participants rated per condition to two. This allows to investigate not only the presence of island effects and the variability between participants as in Experiment 1, but also the variability of each condition within participants across the two ratings. We will therefore report three subsections of results for Experiment 2. Second, we used the same non-island structure for all three island types -specifically, an embedded declarative CP. This is a logically possible non-island structure for all three islands, therefore testing it adds a dimension of generality to our results. (The logic of the factorial design is such that the measurement of the island effect, which is in the interaction term, will not be affected by the choice of the non-island structure, as long as the nonisland structure does not itself induce an interaction with scrambling. The only consequence of this change is in the main effect of structure.) Using the same non-island structure across all three islands also reduced the number of conditions tested (by 4), helping to offset the increase in tokens per condition.

Materials and survey construction
In Experiment 2, each participant completed a survey that consisted of 60 items: 16 experimental items (2 each of 8 target conditions) and 44 filler items pseudorandomized to avoid related experimental items appearing in succession. The 8 target conditions were non-scrambling and scrambling versions of declarative CPs (as in 4a and 4b), noun complements (4c and 4d), relative clauses (derivable from 5b), and coordinated NPs (derivable from 6b). We created 4 lexically matched sets of items per structure. The items were then distributed among 2 experimental lists using a Latin square procedure so that participants saw a unique lexical item in each trial.

Participants and presentation
Ninety-three participants from two universities in Tokyo, Japan, participated in the experiment. We excluded three participants from analysis because their answers to our language background questionnaire revealed that they had significant exposure to a language other than Japanese before they were 10-years old. Ninety self-reported native speakers of Japanese remain in the analysis. Participants either received course credit for their participation or 500 yen. The presentation was identical to Experiment 1.

Results
In this section we describe the results of the experiments, with a focus on the three questions licensed by the new design: (i) the presence or absence of the superadditive interaction indicative of island effects (and, relatedly, the relative size of the effect), (ii) the variability of island effects across participants, and (iii) the consistency of participants' ratings across the two tokens of each condition.

The presence or absence of island effects
As Figure 5 shows, Experiment 2 again revealed clear evidence of island effects with coordinate structures and relative clauses -the visual pattern suggests a large superadditive interaction, and both statistical analyses corroborate the interaction. However, for our critical case, noun complements, our statistical criteria suggest that the result is inconclusive. The p-value is well above the conventional threshold of .05, and the Bayes factor is below the conventional threshold of 3. This suggests that we have no evidence either for or against noun complement islands in Experiment 2. There is a small visual trend toward a superadditive interaction in the plot, and the Bayes factor is approaching 3, so it is tempting to wonder whether running a larger sample might push the result to cross our statistical thresholds. What this would mean in practice is that there is an underlying island effect for noun complements, but it differs from the island effects for relative clauses and coordinate structures in important ways. First, it would be substantially smaller in size. It must be so small that it did not appear at all in Experiment 1 with 89 participants and one token per participant and does not reliably appear in Experiment 2 with 90 participants and two tokens per participant. In contrast, relative clauses and coordinate structures show relatively large and reliable effects in both experiments. (According to  only 37 participants and one token per participant are necessary for 80% power to detect medium effect sizes and only 17 to detect large effect sizes.) Furthermore, as the raw scores show, the island violating condition (island/scrambling) for noun complement islands is rated above the midpoint of the scale (around 4.5 in both experiments) and does not result in unacceptability. This would make the hypothetical island effect a subliminal island (Almeida 2014;Tanaka & Schwartz 2018;Keshev & Meltzer-Asscher 2019), in contrast with relative clause and coordinate structure islands, which are classic island effects that leads to unacceptability (the island violating condition is in the lower half of the scale). Subliminal island effects raise difficult challenges for the theory of island effects. As one anonymous reviewer notes, if future studies do determine that there is a subliminal island effect with noun complements in Japanese, it will be critical to explore the source of the effect to determine if it is caused by the grammar, by sentence processing costs (e.g., Keshev & Meltzer Asscher 2019), or by an artifact of averaging together two or more distinct groups of participants (e.g., Kush et al. 2018). However, since there is no statistically reliable effect in our data, and this is just a hypothetical possibility, we cannot explore the source in this study. (We also note that island/scrambling condition of the coordinate structure island is rated below the mean of the lowest rated filler. This means that it could be sitting at the floor. But since it is the largest island effect in the experiment, the theoretical consequence of underestimating its effect size is minimal. Instead, it tells us that the island/scrambling condition for relative clause islands is not sitting at the floor, despite being roughly equal in acceptability to the lowest rated filler.) To summarize, by conventional statistical criteria, Experiment 2 provides strong evidence for large, classic island effects with relative clauses and coordinate structures, but no evidence either for or against island effects with noun complements. And, if one wishes to interpret the visual and BF trend as evidence that there may be a small noun complement island effect that we failed to detect, one must also conclude that it differs substantially from relative clause and coordinate structure islands both in size and in location in the scale (i.e., it would be a subliminal island).

Variability in island effects between participants
Turning next to the variability in island effects between participants in Experiment 2, Figure 6 shows that noun complement islands once again show a relatively narrow normal distribution. There is no evidence that there is excessive variability in noun complement islands compared to relative clause and coordinate structure islands. We do note, however, that the center of the distribution for noun complement islands is shifted slightly toward the negative, as expected given the small trend that we saw in the mean ratings in Figure 5. The other two island types continue to show the same substantial shift toward the negative that we saw in Experiment 1. Figure 6: Experiment 2. The distribution of island effect sizes by participant, calculated as differences-in-differences scores for both z-scores and raw scores. The solid line is an estimate of probability density.

Variability in island effects within participants
Though there is no evidence of increased variability for noun complements between participants, it is possible that there is increased variability within participants. Recent work by Kush et al. (2018;2019) in Norwegian has suggested that some island effects that appear relatively small when viewed through the grand means of the sample may in fact be driven by inconsistent judgments within each participant. Though the source of this inconsistency is still an open area of investigation, here we provide a similar analysis for the ratings in Experiment 2. Figure 7 plots the two judgments that each participant gave for each structure in a scatterplot, with the first judgment along the x-axis and the second judgment along the y-axis. The columns represent each structure, and the rows separate the no-scrambling (top row) and scrambling (bottom row) conditions. We divide each plot into four quadrants. A point in the top right quadrant (Quadrant 1) represents a participant who rated both tokens in the upper half of the scale. For convenience, we will label such a pattern consistent acceptor. A point in the bottom left quadrant (Quadrant 3) represents a participant who rated both tokens in the lower half of the scale. We can label such a pattern consistent rejector. The other two quadrants (Quadrants 2 and 4) represent participants who rated one token in the upper half of the scale and one token the lower half of the scale. We will label this pattern inconsistent. We have added two features to make the plot a bit easier to read: colors representing the three patterns, and two-dimensional (joint) probability density estimates to draw attention to the density of the points in each location. Similar to a topographic map, in a two-dimensional probability density plot, concentric circles that are closer together represent higher density (because, like topographic maps, these plots are looking down on the peaks in the density space from directly above). Scatterplots of the ratings for the two tokens of each condition for each participant, with two-dimensional (joint) probability density estimates overlaid. The points are colored according to the type of judgment pattern defined by the midpoint (0) of the z-score scales. Figure 8 presents the same plot for the raw scores. We include the raw scores here because we saw in Figure 5 that the midpoint of the z-score scale corresponded with ratings above the midpoint of the raw score scale. This suggests that, on average, the balance of items in the survey was slightly skewed toward higher ratings. This means that the dividing lines in Figure 6 represent consistency relative to the midpoint of the distribution of items in the survey, while the dividing lines for the raw scores in Figure 8 represent consistency relative to the absolute midpoint of the raw rating scale. We have added a small amount of jitter to the points in Figure 8 to make all of the points visible (without jitter, there is quite a bit of overlap among points because the raw scale only allows 7 distinct ratings).  Figure 8: Experiment 2. Scatterplots of the ratings for the two tokens of each condition for each participant, with two-dimensional (joint) probability density estimates overlaid. The points are colored according to the type of judgment pattern defined by the midpoint (0) of the raw score scales. The integer nature of the raw scale would normally mean that many points perfectly overlap. We have added a small amount of jitter to the points to make them all visible.
In both figures we see the same patterns. The no-scrambling conditions (top row) appear to show the upper bounds of consistency in this experiment -the vast majority of participants show the consistent acceptor pattern (Quadrant 1) for each structure, with a small number of inconsistent patterns mixed in. In contrast, the scrambling conditions (bottom row) reveal potentially relevant patterns. In the first column, for the by-hypothesis grammatical scrambling out of declarative CP, we see that the largest mass of participants is in the consistent acceptor quadrant, while there is also a non-negligible number of participants in each of the other three quadrants. This provides more nuance to the mean rating in Figure 5 -we see now that the middleof-the-scale mean rating (near 0) was actually driven by a mix of consistent acceptors, consistent rejectors, and inconsistent participants. This sets a baseline expectation for NP-scrambling consistency: the rating of NP-scrambling itself, in the absence of islands, is relatively variable in Japanese. We can then apply this baseline as we look at the potential island structures.
For noun complements, we see a shift in the probability mass that moves a bit to the left and down from the declarative CP baseline. The center of mass does not quite fully cross into the consistent rejector quadrant, but instead hovers over the horizontal axis line (indicating a rating near 0 for the second token). This shift from the baseline established by scrambling out of declarative CPs is in line with the equivocal results that we saw for noun complements in the means in Figure 5 and the DD scores in Figure 6 -there is a small trend toward a slightly negative rating, but enough variability that it is still plausibly an effect around 0. For relative clauses, we see a shift further left and down. Though there is still variability in relative clauses, the vast majority of participants are either consistent rejectors or inconsistent raters, consistent with the grand means in Figure 5 and DD scores in Figure 6. Finally, we see an even further shift toward the consistent rejector quadrant for coordinate structures, which is again consistent with the grand means in Figure 5 and the DD scores in Figure 6.
Taken together, what we see is that scrambling itself introduces a fair amount of withinparticipant variability to judgments when compared to no-scrambling conditions. Noun complements appear (visually) to show roughly the same amount of variability as scrambling from declarative CP structures. This leads to two conclusions. The first is that there is no additional variability in noun complements that could explain past debates about their island status. The second is that noun complements show the same general pattern of variability as unequivocally grammatical sentences. This stands in contrast to the unequivocal islands, which tend to show a mild increase in consistency compared to the grammatical controls. This again points to noun complements being qualitatively distinct from the other islands (and perhaps more similar to grammatical sentences).

General Discussion and Implications
This paper presented two experiments using the factorial definition of island effects to compare three island types with relatively large samples (89 and 90 participants respectively): relative clauses, noun complements, and coordinated NP structures. Experiment 1 tested one token per condition, whereas Experiment 2 tested two tokens per condition, allowing for an investigation of within-participant variability. Both experiments unequivocally show that relative clauses and coordinated NP structures are islands for NP-scrambling in Japanese, corroborating previous studies (e.g., Harada 1977;Haig 1978;Saito 1985;1987). Experiment 1 showed a clear lack of evidence of noun complement island effects, while the results from Experiment 2 were statistically inconclusive. We explored what it would mean to interpret the inconclusive result of Experiment 2 to mean that there is an as-yet-undetected island effect present. In that hypothetical, noun complement islands would (i) be substantially smaller than relative clause and coordinate structure islands, and (ii) be subliminal islands (because all four conditions are above the midpoint of the scale). We also closely examined our results for between-and within-participant variability. Although Yano (2019) raised the possibility that noun complements are associated with an increase in between-participant variability (compared to adjunct islands), we found that noun complements show the same or less between-participant variability than the other islands. We also found that noun complements show the same amount of within-participant variability as unequivocally grammatical scrambling conditions. Taken as whole, we interpret our findings to suggest that noun complements are not islands (joining Haig 1978). But, given our inconclusive result in Experiment 2 (despite a relatively large sample size), out of an abundance of caution we leave open the possibility that there may be a very small, subliminal noun complement island effect. This could also explain why some researchers have reported mild noun complement island effects in the past (e.g., Saito 1985;1987), or observed inconsistent results with typical sample sizes (e.g. Yano 2019).
Though it has occasionally been claimed that island effects for noun complements are smaller than island effects with relative clauses (e.g., Chomsky 1986), to the best of our knowledge, our study is the first to provide formal experimental evidence that the two islands pattern qualitatively differently. In fact, the only previous studies to our knowledge to directly test both noun complements and relative clauses in the same formal experiment are Kush et al. (2018;2019), which tested both islands in Norwegian for wh-questions and topicalization, respectively. The results of these two studies suggest that the effect size for noun complements and relative clauses are approximately equal in Norwegian (though it is always possible that the difference in effect size is simply too small to detect reliably with the sample sizes used in these studies). Our findings also challenge the claim that noun complements are simply relative clauses in disguise (e.g., Nichols 2003;Kayne 2008;Arsenijević 2009;Haegeman 2012;cf. de Cuba 2017), and the claim that Japanese lacks English-like relative clauses entirely (e.g., Kuno 1973;Murasugi 2000). Under these analyses, it would be unexpected to find that only relative clauses are islands in Japanese.
We offer two more observations about the two types of complex NPs in Japanese. First, our results appear to challenge Stowell's (1981) suggestion that island status correlates with the possibility of complementizer deletion: CP complements of verbs allow complementizer deletion (12) and are not islands, while CP complements of nouns do not allow complementizer deletion and are islands (13). (The examples in (12) and (13) are our own, with diacritics indicating the pattern discussed by Stowell 1982.) (12) Jessica claimed that/∅ Lisa invented the app.
(13) Jessica made the claim that/*∅ Lisa invented the app.
Stowell argues that complementizer deletion is possible in (12) because the embedded clause is a true complement of the verb (and therefore the empty category created by the deletion is governed, satisfying the ECP). Stowell further argues that the impossibility of complementizer deletion in (13) suggests that the embedded clause is, in fact, not a complement, but rather an adjunct (leading to an ECP violation because the empty category created by the deletion is not governed; cf. Grimshaw 1990;Kiss 1990;Takahashi 1994;Sabel 2002;de Cuba 2017). If the embedded clause in noun complement constructions is in fact an adjunct, the island effect observed with English noun complements is expected as a type of adjunct island effect. Interestingly, Fukui (1988) argues that the Japanese noun complements with toyuu also involve an adjunct because toyuu cannot be deleted, as in (14).
(14) Taro-ga sore-o teniire-ta toyuu/*∅ uwasa T-NOM it-ACC obtain-PST toyuu rumor 'the rumor that Taro obtained it' (Fukui 1998: 513; (26)) If Fukui (1988) is correct, and the embedded clause inside the noun complement is an adjunct, our finding that toyuu noun complements show little to no island effect is puzzling. Fukui's claim, however, is not supported by other Japanese-internal facts about complementizer deletion. First, the Tokyo dialect does not allow for deletion of the complementizer to/tte even when the embedded clause appears to be a clear case of a verbal complement. (The example in 15 is our own, with diacritic indicating the pattern reported in Fukuda 2000).
(15) Taro-wa Hanako-ga ki-ta to/tte/*∅ it-ta/omot-ta. T-TOP H-NOM come-PST to/tte say-PST/think-PST 'Taro thought/said (that) Hanako came.' Thus, we have no reason to expect that complementizers in general can be deleted even when the embedded clause is indeed a complement; therefore, it is not surprising that toyuu in (14) cannot be deleted. Second, in some Western dialects of Japanese, the complementizer deletion in (15) is acceptable (Saito 1987;Fukuda 2000;Kishimoto 2006). Yet, crucially, no one has claimed that the CP complement of a verb with to in the Tokyo dialect is an island while it is a non-island in these Western dialects. What is more, according to speakers of these dialects whom we consulted, the same complementizer to/tte can also be deleted in noun complements, as in (16), though we note that this fact should be quantified in future judgment or corpus studies.
(16) Taro-ga sore-o teniire-ta tte/∅ yuu uwasa T-NOM that-ACC obtain-PST tte YUU rumor 'the rumor that Taro obtained that' Thus, when we look at the facts of complementizer deletion across both English and Japanese, the apparent correlation between island status and the possibility of complementizer deletion disappears. 3 An anonymous reviewer also notes that Spanish provides further evidence against half of the correlation with complementizer deletion. Pañeda et al. (2020) show that there is either no island effect or a very small island effect for noun complements in Spanish. Noun complements in Spanish do not allow complementizer deletion, so under the Stowell (1981) analysis, they should show island effects, contrary to fact.
Our second observation is that island status does not appear to correlate with the syntactic complexity of the embedded clause (in terms of number of available positions in the clause, or amount of functional structure within the clause). While relative clauses and noun complements in English are analyzed as involving an embedded CP, i.e., embedded clauses with the same complexity, as discussed in Section 2.1, there are several reasons to believe that the structure of relative clauses in Japanese is less complex along certain dimensions than the structure of the embedded clauses inside noun complements. Despite this observation, only relative clauses show clear island effects. This observation potentially has implications for bounding-based approaches to island effects (like the classic Subjacency and barriers approaches). Though it clearly depends on the details of the theory, in principle, more complex syntactic constituents like noun complements in Japanese have the potential to host more bounding nodes or barriers than less complex constituents like relative clauses, despite island effects patterning in the opposite direction. This suggests that our results may be relevant for adjudicating among specific implementations of bounding-based approaches to island effects.
While it is beyond the scope of this paper to evaluate the full set of theories of islands in the literature, we would like to mention the consequences of our results for a few prominent theories to illustrate their potential theoretical value. Huang's (1982) Condition on Extraction Domains, Lasnik & Saito's (1984) gamma-marking, and Chomsky's (1986) barriers approach all share the intuition that there is a fundamental distinction between adjunct CPs and complement CPs (in terms of government, gamma-marking, and L-marking respectively), and that this distinction causes (most) adjuncts to be islands and (most) complements to be non-islands. English noun complements create complications for this view, as they appear to be complements, but are nonetheless islands. Under these approaches, our finding that Japanese noun complements are likely not islands (or are extremely small islands) suggests that they are indeed complements, and that it is English noun complements that are exceptional in some way, not noun complement constructions in general (but see Hankamer & Mikkelson (to appear) for a proposal that Dutch and English noun complements involve a CP complement to D or a CP adjunct to DP). The intuition that complements and adjuncts are fundamentally distinct, and that this difference is the source of island effects, is also central to modern phase-based approaches to island effects. For example, Rackowski & Richards (2005) propose that phrases that enter into an Agree relation with a phase head (i.e., v) are transparent to extraction, while phrases that do not enter into an Agree relation are islands. They present evidence that, in Tagalog, complement CPs, which are transparent to extraction, show morphological evidence of this Agree relation, while adjunct CPs, which are islands, do not. Under this approach, our results suggest that, while noun complements in English must not enter into an Agree relation with the next phase head, noun complements in Japanese do, despite the fact that neither language shows a morphological reflex of this relation. Similarly, Müller (2010) proposes that phases are transparent to extraction as long as they are active and can thus be given an edge feature to accommodate the extraction, in turn circumventing the Phase Impenetrability Condition of Chomsky (2001;. Phase heads are active until their final (i.e., last merged) specifier is merged. Adjuncts are islands because they are the final specifier of a phase head, while complements are not islands because (by definition) they are not specifiers. Under this approach, English noun complements must be final specifiers, and thus adjuncts, while Japanese noun complements would be true complements. For each of these potential analyses, future studies could explore the properties of English and Japanese noun complements (beyond complementizer deletion) to determine if there is independent evidence (beyond island effects) for postulating these critical differences between the two types of CPs.

Conclusion
This article presented two studies that contribute to the discussion of complex NP islands in Japanese. While there is little contention among previous studies that relative clauses are islands for NP-scrambling, there is no consensus regarding noun complements: Haig (1976) argued they are not islands, Saito (1985;1987) claimed they are, and Yano's (2019) two experiments yielded contradictory results, with one showing a small island effect and the other showing no island effect. We presented two acceptability judgement experiments to compare relative clauses and noun complements with coordinated NP structures, which are uncontroversially islands. Both experiments were designed using the factorial definition of island effects. Participants saw one token per condition per participant in Experiment 1, while participants saw two tokens par condition in Experiment 2, thus allowing us to explore possible between-and within-participant variability, which has been raised as a possible complicating factor in previous studies of island effects (Kush et al. 2018;2019;Yano 2019). Our results corroborated previous studies in that relative clauses and coordinated NP structures yield large island effects in Japanese. Our results for noun complements yielded conclusive evidence that they are not islands in experiment 1, and yielded statistically inconclusive results in experiment 2. A closer look at the results revealed no between-participant variability in the judgments on noun complements in Experiments 1 or 2, and the same level of within-participant variability (in Experiment 2) for both scrambling out of verb complements and noun complements. We tentatively conclude that noun complements in Japanese are not islands, but leave open the possibility that noun complements in Japanese yield small, subliminal island effects that cannot be reliably detected even with the large sample sizes here. Even granting this possibility, noun complements are qualitatively distinct from relative clause island effects, which are easy to detect at these sample sizes.
Our results attest to a clear difference -whether the difference lies in the island status or in the size of island effects -between relative clauses and noun complements in Japanese. Furthermore, our results attest to a clear difference between noun complements in Japanese, which either do not exist or are subliminal, and noun complements in other languages that have been tested using the factorial definition, such as English, Italian, and Norwegian, which show large, robust island effects (Sprouse et al. 2011, Sprouse et al. 2012, Kush et al. 20182019). We have argued that these findings have direct consequences for most existing theories of island effects, for theories of the relationship between complementizer deletion and island status, and for theories of the relationship between the complexity of syntactic structure and island status. This in turn suggests that future studies should probe the properties of relative clauses and noun complements (cross-linguistically) along these dimensions.
Our study used scrambling dependencies to evaluate island effects, because wh-questions, which are typically used to investigate island effects in wh-movement languages, do not move overtly in Japanese. There are, however, studies that suggest that wh-in-situ also plays an important role in the theories of islands in these languages. Studies such as Tanaka & Schwartz (2017) on Japanese and Lu et al. (2020) on Mandarin Chinese found that argument wh-phrases that stay in situ within a relative clause give rise to island effects, contrary to previous claims that only certain wh-adjuncts invoke island effects in these languages. (cf. Kim & Goodall 2016, whose experimental investigation of island effects involving wh-in-situ in Korean also observed island effects; but cf. Sprouse et al. 2011, which found no island effects for subject, adjunct, whether, and noun complement islands in Japanese.) The complex picture emerging from these studies suggests a need for comprehensive comparisons across island effects and dependency types within wh-in-situ languages.