Neural dynamics of sentiment processing during naturalistic sentence reading

When we read, our eyes move through the text in a series of fixations and high-velocity saccades to extract visual information. This process allows the brain to obtain meaning, e.g., about sentiment, or the emotional valence, expressed in the written text. How exactly the brain extracts the sentiment of single words during naturalistic reading is largely unknown. This is due to the challenges of naturalistic imaging, which has previously led researchers to employ highly controlled, timed word-by-word presentations of custom reading materials that lack ecological validity. Here, we aimed to assess the electrical neural correlates of word sentiment processing during naturalistic reading of English sentences. We used a publicly available dataset of simultaneous electroencephalography (EEG), eye-tracking recordings, and word-level semantic annotations from 7,129 words in 400 sentences (Zurich Cognitive Language Processing Corpus; Hollenstein et al., 2018). We computed fixation-related potentials (FRPs), which are evoked electrical responses time-locked to the onset of fixations. A general linear mixed model analysis of FRPs cleaned from visual- and motor-evoked activity showed a topographical difference between the positive and negative sentiment condition in the 224-304 ms interval after fixation onset in left-central and right-posterior electrode clusters. An additional analysis that included word-, phrase-, and sentence-level sentiment predictors showed the same FRP differences for the word-level sentiment, but no additional FRP differences for phrase- and sentence-level sentiment. Furthermore, decoding analysis that classified word sentiment (positive or negative) from sentiment-matched 40-trial average FRPs showed a 0.60 average accuracy (95% confidence interval: [0.58, 0.61]). Control analyses ruled out that these results were based on differences in eye movements or linguistic features other than word sentiment. Our results extend previous research by showing that the emotional valence of lexico-semantic stimuli evoke a fast electrical neural response upon word fixation during naturalistic reading. These results provide an important step to identify the neural processes of lexico-semantic processing in ecologically valid conditions and can serve to improve computer algorithms for natural language processing.


Introduction
The written word has fundamentally shaped human cultural and cognitive evolution and still today remains a primary medium for information storage (e.g., Wikipedia) and human communication (e.g., email and social media). Nonetheless, very little is currently known about how human readers extract and process meaning from written text.
The cognitive and neural processes of language processing in humans have been of major interest in cognitive science, neurolinguistics, and neuropsychology (Currie, 1990;Manning et al., 1999;Mason and Just, 2006). Previous research has used custom reading material to study specific aspects of linguistic material (e.g., phonetics, morphology, and semantics) in highly controlled experimental settings in order to collect repeated self-reports, forced-choice ratings, eye-movement data, or electrical or functional neuroimaging data ("The Oxford Handbook of Neurolinguistics," 2019). The neural correlates of reading have traditionally been studied with serial word-by-word presentation with a fixed presentation time, a condition that eliminates important aspects of the normal reading process and precludes direct comparisons between neural activity and oculomotor behavior (Dimigen et al., 2011;Kliegl et al., 2012). However, the electrical neural correlates of naturalistic reading of real sentences has been investigated less frequently due to a number of challenges related to identifying the exact timing and type of visual stimuli presented during reading, as well as the contamination of electrical neuroimaging data with eye-movement-related motor-and visual-evoked potentials. Indeed, because of an excellent temporal resolution and comparably low cost, electroencephalography (EEG) in combination with eye-tracking have become important tools for studying the temporal dynamics of naturalistic reading (e.g., Dimigen et al., 2011;Frey et al., 2018;Hollenstein et al., 2018;Loberg et al., 2018;Sato and Mizuhara, 2018). In this context, fixation-related potentials (FRPs), which are the evoked electrical responses time-locked to the onset of fixations, have been studied and have received broad interest by naturalistic imaging researchers for free viewing visual perception (e.g., R€ am€ a and Baccino, 2010), brain-computer interfaces (e.g., Finke et al., 2016), and naturalistic reading (e.g., Dimigen et al., 2011). In naturalistic reading paradigms, FRPs allow the study of the neural dynamics of how novel information from currently fixated text affects the ongoing language comprehension process. Evidence for this proposition has been provided by Dimigen et al. (2011), who showed that naturalistic reading of unexpected vs. expected words induced an N400 response in the FRP signals, previously observed for experimental paradigms using single word presentations (for a review, see Kutas and Federmeier, 2011). In addition, Frey et al. (2018) found modulations of slow-wave components of FRPs that depended on whether participants performed a memorization or decision-making task while reading, and Sato and Mizuhara (2018) found differences in early (100-200 ms) and late (400-500 ms) FRP components between words subsequently forgotten or remembered by the participants. Collectively, these studies indicate that FRPs provide useful information at high temporal resolution about the cognitive-neural processes that underlie naturalistic reading in humans.
Research on how humans process naturalistic language is paralleled by another line of research in artificial intelligence, called natural language processing, which aims at developing computer algorithms for decoding the meaning from natural language material. In this context, an important topic is sentiment analysis (Cambria et al., 2013;Liu and Zhang, 2012), which aims at detecting emotions and opinions expressed in text for applications such as hate-speech or sarcasm detection (Mamidi et al., 2019). Sentiment analysis has mainly focused on text-based processing using linguistic models (Agarwal et al., 2015) or machine-learning-based prediction of sentiment annotations of humans (Yang et al., 2012). More recently, however, other researchers (including ourselves) have proposed that text-based sentiment analysis can be considerably improved by contemplating neuro-cognitive data produced by human readers during naturalistic reading (Chanel et al., 2006;Hollenstein et al., 2018;Mishra et al., 2016;Raudonis et al., 2013). These signals include eye-movement parameters such as fixation duration and number of fixations, as well as the FRPs elicited by fixating words of different sentiment. Related work has used EEG power spectra data from participants watching movie clips (Nie et al., 2011;Wang et al., 2014) or viewing pictures of human faces (Li and Lu, 2009) to decode the polarity of the evoked sentiment (i.e., positive vs. negative). In a recent study, we showed improvement of decoding performance for relationship classification, entity recognition, and sentiment analysis using gaze position and EEG activity, in addition to text-based features (Hollenstein et al., 2019). However, the spatiotemporal neural dynamics of sentiment processing in humans remain largely unknown. With a few exceptions, FRPs have not been used to assess emotional processing in humans. Gu erin-Dugu e et al. (2018) recorded FRPs from participants freely viewing images of faces with different emotional expressions and found FRP differences depending on the emotion expressed at 200-300 ms after fixation onset. Simola et al. (2013) showed images of pleasant and unpleasant scenes to participants and found FRP differences at 400-500 ms. Based on these results for free-viewing image exploration, it remains unknown whether similar FRP differences are evoked by naturalistic reading of text with positive vs. negative sentiment.
In this study, we investigated the neural dynamics of sentiment processing in participants silently reading sentences from English movie reviews while simultaneous EEG and eye-tracking signals were recorded from 7129 words in 400 sentences (data taken from Hollenstein et al., 2018). The sentences were presented to the subjects in a naturalistic reading scenario, where the complete sentence was presented on the screen and the subjects read each sentence at their own speed. This allowed readers themselves to determine how long they fixated on each word and on which word to fixate next. By simultaneously acquiring EEG and eye-movement data, we determined the exact timing and gaze position with respect to word boundaries while subjects were reading sentences, a phenomenon that allowed us to extract EEG signals for word-level processing. In order to extend current insights into lexico-semantic processing during naturalistic reading, we aimed to identify whether and how words with different sentiment connotation (i.e., positive vs. negative emotional valence) would affect the FRP responses to word fixations during naturalistic reading. Moreover, in line with recent work on natural language processing (Hollenstein et al., 2018), we aimed to assess whether the word sentiment could be decoded from FRP data in a data-driven fashion.

Methods
The data used in this study were taken from the ZuCo dataset (Hollenstein et al., 2018), an openly available dataset of EEG and eye-tracking data from subjects reading English sentences (https://doi.org/10.17605/ OSF.IO/Q3ZWS). A detailed description of the entire ZuCo dataset, including individual reading speed, lexical performance, average word length, average number of words per sentence, skipping proportion on word level, and effect of word length on skipping proportion, can be found in Hollenstein et al. (2018). In the following section, we will describe the methods relevant to the subset of data used in the present study.

Participants
The ZuCo dataset comprises recordings from 12 healthy adults (5 females, 22-54 years, all right-handed) who are native English speakers (average score of 95% in the Lexical Test for Advanced Learners of English; Lemh€ ofer and Broersma, 2012). All participants gave their written informed consent prior to participation in the study. Data from all participants of the ZuCo dataset were included in the present study.

Materials and procedure
ZuCo contains data from three tasks: Normal Reading, Task-Specific Reading, and Sentiment Reading (see Hollenstein et al., 2018). In the present study, we focused our analysis on the data from the Sentiment Reading task (see below) because, compared to the other tasks, the reading material from the Sentiment Reading task contained frequent use of positive and negative expressions. Moreover, human annotations of sentiments at word, phrase, and sentence level were available (provided by Socher et al., 2013).
The linguistic material consisted of English sentences extracted from movie reviews from the Stanford Sentiment Treebank (Socher et al., 2013). The Stanford Sentiment Treebank comprises 11,855 single sentences parsed into individual phrases and annotated by three human judges. Sentiment labels are available for the word, phrase, and sentence level and consist of the average rating across subjects on a 5-point scale ranging from À2 (very negative) to 0 (neutral) to þ2 (very positive). For the ZuCo dataset, a total of 400 sentences were randomly selected: 140 positive, 137 negative, and 123 neutral (based on human annotation). Each sentence was individually shown to the participant on a computer screen. The text had black color, Arial font, 20-point font size, and was presented on a gray background. Letters were 0.8 mm high, corresponding to a 0.674 visual angle. Words were double-spaced, lines were triple-spaced, and each line consisted of a maximum of 80 letters or 13 words. Long sentences spanned multiple lines with a maximum of 7 lines and only one sentence was presented at a time (see Fig. 1A for an exemplar sentence).
Participants were equipped with a control pad and used their right index finger to trigger the onset of the next sentence. Participants were orally instructed before the experiment to carefully monitor the elicitation of emotions and opinions while reading each sentence and to occasionally answer a control question. Throughout the experiment, control questions were presented for 47 out of 400 sentences, immediately after the participant had finished reading and pushed the button. Control sentences were the same for all participants. The following question was presented on the screen: "Based on the previous sentence, how would you rate this movie? (very bad) |1-2 -3-4 -5 | (very good). Please press the corresponding number on the keyboard." Ratings for the control questions were given with the control pad numbers 1 (very bad) to 5 (very good) and there was no time limit for providing the response. On average, participants correctly rated 79.53% (standard deviation [SD] ¼ 11.22%) of the control questions. An overview of the sentiment ratings for control sentences is reported in Hollenstein et al. (2018).
Before the experiment, 3-5 practice sentences (including a control question) were presented to each participant to familiarize them with the task. The 400 sentences of the Sentiment Reading task were presented in 6 blocks of 60 sentences (10-15 min per block) and a final block of 40 sentences (8-12 min) in order to allow for regular re-calibration of the eye-tracker between blocks. The order of blocks and sentences within blocks was identical for all subjects. The blocks were presented in two sessions (4 blocks in the first session and 3 blocks in the second session, at the same time of day) in order to reduce fatigue in participants, who completed two additional tasks for the ZuCo study (Hollenstein et al., 2019). Between recording sessions, the proportion of sentences with negative (33%), positive (33%), or neutral (33%) sentiment out of the total number of sentences as well as the proportion of words with negative (6%), positive (10%), or neutral (84%) sentiment out of the total number of words were matched.

Data acquisition
Data acquisition took place in a sound-attenuated and dark Faraday recording cage. Participants were comfortably seated at a table in front of a 24-inch monitor (ASUS ROG, Swift PG248Q, display dimensions 531.4 Â 298.9 mm, resolution 800 Â 600 pixels [resulting in a 400 Â 298.9 mm display], and vertical refresh rate of 100 Hz) placed 68 cm from the participant. A stable head position was ensured via a chin rest. Participants were instructed to stay as still as possible during the tasks. They were offered snacks and water during the breaks and were encouraged to rest. The experiment was programmed in MATLAB 2016b (The Math-Works Inc., Natick, MA, US), using the PsychToolbox extension. The order of the reading paradigms (i.e., Normal Reading, Task-Specific Reading, and Sentiment Reading) and sentence presentation was the same for all participants. Participants completed the tasks sitting alone in the room while two experimenters monitored their progress in the adjoining room.

Eye-tracking acquisition
An infrared video-based eye-tracker (EyeLink 1000 Plus, SR Research, http://www.sr-research.com/) with a sampling rate of 500 Hz and an instrumental spatial resolution of 0.01 was used to record gaze position and pupil size during the experiment. The eye-tracker was calibrated with a 9-point grid before each recording block. Specifically, participants were asked to direct their gaze in turn to a dot presented at each of nine locations in a random order. In a validation step, the calibration was repeated until the error between two measurements at any point was less than 0.5 , or the average error for all points was less than 1 .

EEG acquisition
High-density EEG data were recorded at a sampling rate of 500 Hz with a bandpass filter of 0.1-100 Hz, using a 128-channel EEG Geodesic Hydrocel system (Electrical Geodesics, Eugene, Oregon). The recording reference was at Cz. For each participant, head circumference was measured, and an appropriately sized EEG net was selected. The Fig. 1. A. Exemplar sentence and eye-movement sequence during naturalistic reading. Dots represent fixations, circle sizes represent fixation durations, lines connecting dots represent saccades, and colors represent sentiment type (positive, negative, or neutral) of the fixated word. B. Fixation event selection procedure. C. Comparison of fixation-related potentials (FRPs) before and after cleaning by electroencephalography (EEG) deconvolution modeling. FRPs before cleaning consist of both high-amplitude visual-motor signals and lower amplitude signals related to lexico-semantic processing. FRPs after cleaning consist only of lower amplitude signals from lexico-semantic processing and of random noise retained after removal of visual-motor signals. From top to bottom: Butterfly plot of average FRPs across all trials included in the analysis (black lines are FRPs for different channels, red line is the FRP of channel Pz); average FRPs across all channels; single-trial FRPs for channel Pz sorted by fixation duration and averaged over 50 adjacent epochs (black line represents fixation duration).
impedance of each electrode was checked prior to recording, to ensure good contact, and was kept below 40 kΩ. Good electrode impedance levels were checked and restored after every third block of 60 sentences (approximately every 30 min).

Eye-tracking preprocessing
The EyeLink 1000 tracker processes eye-position data: It identifies saccades, fixations, and blinks. Saccades are detected by the velocity and acceleration of the eye movements. Here, SR-research default system parameters have been used to define saccades: an acceleration threshold of 8000 per sec 2 , a velocity threshold of 30 per sec, and a deflection threshold of 0.1 . Fixations were defined as time periods without saccades. The dataset therefore consists of (x,y) gaze location entries for individual fixations. Coordinates were given in pixels with respect to the monitor coordinates (the upper left corner of the screen was (0,0) and down/right was positive). Further, a blink can be regarded as a special case of a fixation, where the pupil diameter is either zero or outside a dynamically computed valid pupil, or the horizontal and vertical gaze positions are zero. For later EEG analysis, we only extracted fixations within the boundaries of each displayed word (Fig. 1B). In naturalistic reading, gaze fixations typically fall on or in-between adjacent words (Dimigen et al., 2011). In order to determine the word currently fixated by the participant, we defined word boundaries (Beymer and Russell, 2005;Hara et al., 2012;Tateosian et al., 2015), i.e., rectangular regions of interest around each word, extended laterally to cover half of the space between the word inside the boundary and the subsequent word in the line. This design resulted in boundaries that were non-overlapping and covered the entire space between subsequent words. Raw data from the eye-tracker showed slightly more variability of gaze positions along the y-axis as compared to the x-axis, similar to . Thus, in our data, gaze position along the y-axis was occasionally close to but outside of the vertical word bounds. Given that naturalistic reading occurs within lines of text and not between the lines, we corrected the y-axis data using the following procedure. Fixations located 50 pixels above the first line or below the last line were excluded from analysis (i.e., out-of-bound fixations). Next, we applied a Gaussian mixture model (GMM) on y-axis gaze data, with the number of Gaussians set equal to the number of lines in the current trial (between 1 and 5; for details, see Hollenstein et al., 2018). As a result, each gaze position was clearly assigned to a specific text line. We used the corrected y-axis gaze positions for subsequent analyses. Fixations that were shorter than 100 ms were excluded from the analyses because they are unlikely to reflect fixations relevant for reading.

EEG preprocessing
EEG data were preprocessed with the Automagic toolbox for MATLAB (version: 1.9, https://github.com/methlabUZH/automagic; Pedroni et al., 2019). One-hundred-and-five EEG channels were used for scalp recordings and nine electrooculography (EOG) channels were used for artifact removal. The remaining channels (lying mainly on the neck and face) were discarded before data analysis (see Langer et al., 2012). Bad electrodes were identified and replaced. Identification of bad electrodes was based on the EEGLab plugin clean_rawdata (http://sccn.ucsd.edu/ wiki/Plugin_list_process). This plugin removes flatline, low-frequency, and noisy channels. A channel was defined as a bad electrode when recorded data from that electrode were correlated at less than 0.85 to an estimate based on other channels (channel criterion). Furthermore, a channel was defined as a bad channel if it had more line noise relative to its signal compared to all other channels (4 standard deviations). Finally, if a channel had a flatline longer than 5 s, it was considered to be bad. In a next step, we ran the EEG processing pipeline "PREP" for robust average referencing (Bigdely-Shamlo et al., 2015), including using the CleanLine plugin for EEGLAB (Mullen, 2012) for removing power line noise at 50, 100, 150, 200, and 250 Hz. Next, the EEG data were band-pass filtered between 1 and 50 Hz with a Hamming windowed-sync finite impulse response zero-phase filter (EEGLAB function pop_eegfiltnew.m) for detrending the data and to remove high-frequency components of no interest. The filter order was defined to be 25% of the lower passband edge. In this study, we used Multiple Artifact Rejection Algorithm (MARA), a supervised machine-learning algorithm that evaluates Independent Component Analysis (ICA) components, for automatic artifact rejection. MARA has been trained on manual component classifications; thus, it captures the wide range of artifacts that manual rejection detects (Winkler et al., 2014(Winkler et al., , 2011. MARA has proven especially effective at detecting and removing eye and muscle artifact components. Specifically, MARA evaluates each component on the six algorithm features from the spatial, spectral, and temporal domain (Winkler et al., 2011(Winkler et al., , 2014. Subsequently, bad electrodes were interpolated by using a spherical spline interpolation eeg_interp.m. We quantified the quality of EEG data using four quality measures implemented in the Automagic toolbox (see Pedroni et al., 2019). One indicator for good quality of the data is the ratio of identified bad and, consequently, interpolated channels (RBC). The more channels that are interpolated, the more of the signal of interest is lost and, hence, the worse the data quality. All subjects had less than 15% RBC. A second quality measure is the ratio of data with overall high amplitude (OHA), which is defined by calculating the ratio of data points (i.e., electrodes x timepoints) that have a higher absolute voltage magnitude of 30 μV. The EEG data of all subjects exhibited an OHA of less than 10%. Similarly, the third quality measure is the ratio of timepoints of high variance (THV). THV is identified where the standard deviation of the voltage measures across all channels exceeds 15 μV. The THV for all subjects was below 10%. Finally, the ratio of channels of high variance (CHV), for which the standard deviation of the voltage measures across all time points exceeds 15 μV, was assessed. The CHV was below 15% for all subjects. During data acquisition, event triggers at the start and the end of each sentence presentation were simultaneously sent from the stimulus presentation computer to both the EEG recording and eye-tracking systems. After data preprocessing, these event triggers served to temporally synchronize the EEG and eye-tracking data using the "EYE-EEG extension" (Dimigen et al., 2011). The synchronization is performed at the event triggers for sentence onset and offset by fitting linear functions to the latencies recorded in the EEG and eye-tracking data and subsequently merging the EEG and eye-tracking data. Synchronization quality was ensured by comparing the trigger latencies recorded in the EEG and eye-tracker data. All synchronization errors did not exceed one sample (2 ms), which is to be expected because the same sampling rate (500 Hz) was used for both EEG and eye-tracking data acquisition. Finally, the synchronized EEG and eye-tracking data were downsampled to 125 Hz using the EEGLAB function pop_resample.

EEG deconvolution modeling
Free viewing is an important characteristic of naturalistic behavior and imposes challenges for the analysis of electrical and functional neuroimaging data. We note that free viewing in the context of our study refers to the participant's ability to perform self-paced reading given the experimental requirement to keep the head still during data recording. In the case of EEG recordings during naturalistic reading, the self-paced timing of eye fixations-with average durations of 200-250 ms and variable onset asynchronies (e.g., Dimigen et al., 2011)-leads to a temporal overlap between successive fixation-related events, including short-latency high-amplitude visuo-motor potentials and mid-to long-latency lower-amplitude lexico-semantic processing related potentials. There is also a contamination of the signal of interest (i.e., lexico-semantic processing) with stereotypical high-amplitude evoked electrical responses to saccadic eye movements and visual processing upon fixation onset. In order to isolate the signals of interest and correct for temporal overlap in the continuous EEG, several authors have proposed methods using linear-regression-based deconvolution modeling for estimating the overlap-corrected underlying neural responses to events of different types (e.g., Ehinger and Dimigen, 2019; Smith and Kutas, 2015a, 2015b) for detailed discussions). Events of interest were electrical responses to saccadic eye movement, visual-evoked responses, blinks, and button-press related motor responses. Here, we used the unfold toolbox for MATLAB (https://github.com/unfoldtoolbox/unfold/; Ehinger and Dimigen, 2019). Deconvolution modeling is based on the assumption that in each channel the recorded signal consists of a combination of time-varying and partially overlapping event-related responses and random noise. Thus, the model estimates the latent event-related responses to each type of event based on repeated occurrences of the event over time. First, a design matrix is created that includes the onset latencies and temporal offset from event onset for different events within a chosen time window. Based on Ehinger and Dimigen (2019), we modeled the EEG during naturalistic reading using eye-tracking based information about fixation onsets, saccade onsets and their amplitudes, blinks, and button press responses. We included the following event types and formulas in our model: Note that the dependent variable "y" corresponds to the EEG data from a given channel, "~" refers to being equivalent to or being modeled by, "1" refers to the intercept term of the model, "duration" is the fixation duration, and "spl(amplitude,10)" refers to the use of a spline predictor with 10 splines for modeling non-linear relationships between EEG responses and saccade amplitude (see Ehinger and Dimigen, 2019 for a similar approach). Next, the design matrix was time expanded to a À600 ms-1000 ms time window around the event onset, and finally the model coefficients (hereinafter betas) were estimated for each channel and subject separately by fitting the combined design matrix to the EEG recorded in each channel. We note that the above listed equations for the different types of events serve to construct a single time-expanded design matrix, which is fit to the continuous EEG data specifically for each EEG channel and independent of the other EEG channels. Thus, the specific order by which the equations are entered in the toolbox does not affect the deconvolution outcome (i.e., the beta estimates). The estimated betas reflect the average responses over all events for each type of event (e.g., evoked electrical responses to saccade-related eye movement, visual-evoked responses, blinks, and button-press-related motor responses), from which overlapping activity was removed. Note that the outcome of deconvolution modeling is one set of beta estimates for each model predictor/type of event (e.g., fixation), but the betas are not estimated for individual events (for similar approaches, see Brodbeck et al., 2018;Dimigen et al., 2011;Smith and Kutas, 2015b). We did not aim to statistically analyze the extracted betas because for group-level analysis our rather small sample size (N ¼ 12) would have lacked statistical power and our primary aim was not to test for differences in visual-motor responses evoked by naturalistic reading. Instead, we aimed to remove high-amplitude and temporally overlapping visual and motor responses from the continuous EEG in order to identify the neural dynamics of word sentiment processing. Accordingly, we used the beta estimates from deconvolution modeling for data cleaning purposes (based on Ehinger and Dimigen, 2019). Specifically, based on the assumption that the continuous EEG consists of a linear combination of temporally overlapping visual, motor, and lexico-semantic processing signals and random noise, we computed a continuous time series of model-predicted activation reflecting only the temporally overlapping visual and motor response (model-predicted EEG). This was achieved by convolving the design matrix of events with the beta estimates from deconvolution modeling, resulting in a model-predicted EEG that consists of the same number of electrodes and time points as the continuous EEG. It is important to note that the model-predicted EEG reflects only visual-motor responses. The design matrix and beta estimates used for computing the model-predicted EEG were based on different types of visual-motor events but did not include information about the type of linguistic material (e.g., sentiment information). Subsequently, the model-predicted EEG was subtracted from the continuous EEG for data cleaning: The resulting cleaned EEG thus corresponds to EEG data from which the high-amplitude temporally overlapping visual and motor-evoked responses were removed, and lower amplitude lexico-semantic processing related signals and residual noise were retained (see Fig. 1C for an illustration of the effect of data cleaning).

Fixation-event selection
In this study, fixations corresponded to reading-and non-readingrelated events, where the majority were word fixations (93,116 of 99,462 fixations, 94%), defined as a fixation within a word boundary (see Fig. 1A for an overview). Given that the main aim of this study was to identify the differences in FRPs from words with positive vs. negative sentiment, we performed the following event selection procedure. First, fixations out of word bounds and fixations shorter than 100 ms were removed, as they were unlikely related to lexico-semantic processing (Sereno and Rayner, 2003). Fixations on neutral words were removed (which in large part consist of filler or stop words, such as "and" and "the"). Next, we excluded words with positive or negative sentiment that were preceded by negation particles (i.e., "no", "not", "without", "nor", and "neither") within three words prior to the current word (<1% of words). For the remaining words, only a small fraction had a strong sentiment, whereas the majority had a moderate sentiment (based on the annotations from the Stanford Sentiment Treebank; Socher et al., 2013). Accordingly, we pooled strongly negative (label "-2") and moderately negative (label "-1") words into the Negative condition and strongly positive (label "þ2") and moderately positive (label "þ1") words into the Positive condition (Fig. 1B).
Finally, we performed a trial sampling procedure to remove wordfixation-related differences between experimental conditions that were unrelated to the emotional valence of the words, including fixation duration, word length, fixation onset probability of preceding and subsequent words, and number of trials per condition (Hauk and Pulvermüller, 2004;Thibadeau et al., 1980). Specifically, we used a stratified random sampling procedure where the trials were grouped by sentiment (positive or negative), word length (1-20 characters), and fixation onset probability of previous or subsequent fixation. Next, we randomly selected trials from each group such that between the positive and negative sentiment condition the number of trials was matched. The number of trials for selection was given by the number of trials available in the group with a minimum number of trials. This approach resulted-within and across subjects-in a matched number of trials, matched probability distribution for word length, and matched fixation onset distributions (Fig. 4). We note that this procedure slightly reduced the number of accepted events relative to the number of available events for both the positive and negative sentiment condition. This phenomenon was related to the word length matching procedure. Specifically, for each subject and each word length, we first determined the number of events for each condition and then matched the number of events between conditions by randomly selecting from the condition with more events the same number of events as present in the condition with fewer events. Thus, across subjects and word lengths trials were removed from both the positive and negative condition in order to achieve a match of word-length distributions between the positive and negative condition for each subject (see Fig. 4D). Finally, we excluded events where the FRP exceeded a AE90 μV amplitude threshold to remove transient noise from the EEG (see below). We illustrate the workflow of event selection in Fig. 1B. Out of the final selection of 12,789 trials, for 6983 trials (55%) the sentiment at word-and sentence-level was congruent, for 2167 trials (17%) word-and sentence-level sentiment were incongruent, and the remaining 3639 trials (28%) contained words with positive or negative sentiment in neutral sentences.

Data segmentation
FRPs were extracted by segmenting the continuous EEG into epochs À600 ms-1000 ms relative to fixation onset events (similar to Ehinger and Dimigen, 2019). Epochs were extracted from both EEG without deconvolution (FRPs) and from EEG cleaned by deconvolution modeling (cleaned FRPs). FRP epochs exceeding a AE90 μV amplitude threshold were removed, and for consistency between analyses, the same epochs were removed from cleaned FRPs, in order to exclude transient noise (238 of 13,027 epochs, 2%). The total number of accepted epochs across subjects was 12,789 (6379 for the positive condition and 6410 for the negative condition, Fig. 1B) for both the FRP and cleaned FRP dataset.

Data analysis
We analyzed reading-related FRP data in two ways. First, the FRP analysis served to identify the spatiotemporal differences in FRPs between the positive and negative sentiment conditions. Second, we used decoding analysis to predict the sentiment label (positive or negative) based on a data-driven selection of FRP features. Finally, we conducted control analysis of eye-tracker data and linguistic features to exclude that systematic differences in eye-movement behavior and linguistic material selected for the positive and negative sentiment condition confounded our EEG analyses.

FRP analysis
We performed the FRP analysis on two independent aspects of the global electrical field: response strength and response topography (Brunet et al., 2011;Murray et al., 2008;Tzovara et al., 2012). Response strength was assessed by global field power (GFP; Lehmann and Skrandies, 1980), which is the standard deviation of the voltages across all channels at a given time point and reflects the global response strength independent of topographical configuration. We analyzed GFP using a timewise general linear mixed model (GLMM) to control for random subject effects. We complemented this analysis with an electrode-by-time GLMM and a nonparametric cluster-based permutation test to identify physiologically plausible electrode clusters where response strength differences are observed (Maris and Oostenveld, 2007). Finally, we tested for differences in response topography between the positive and negative sentiment conditions using the topographic consistency test (TCT; K€ onig and Melie-García, 2010), which assesses differences in the spatial configuration of the underlying neural generators independent of global electrical field strength (Lehmann and Skrandies, 1980). We conducted all analyses on reading-related FRPs, which contain oculomotor, visual, and lexico-semantic processing related signals, and on cleaned FRPs, which are FRPs from which high-amplitude oculomotor and visual processing related signals were removed by deconvolution modeling. All analyses used the Word Sentiment as predictor (see below). In addition, we conducted an analysis using Word, Phrase, and Sentence Sentiment as predictors. In the main manuscript, we report only the analyses for cleaned FRPs using the Word Sentiment predictor, because across the different analyses the results were highly similar, and model fits were slightly better for analyses including the Word Sentiment predictor rather than Word, Phrase, and Sentence predictors. The results of the remaining analyses can be found in the Supplementary Material.

GFP GLMM analysis
We computed GFP (Lehmann and Skrandies, 1980) for each time point and trial and subjected these data to a timewise statistical analysis using a GLMM. The GLMM takes advantage of the large number of fixations available for each subject and provides a more accurate and generalizable estimate of the effects, improved statistical power, and non-inflated type I errors (Singmann and Kellen, 2017). Group-level analyses, in which for each subject only the GFP for one average FRP for the positive and one for the negative sentiment condition are available, were not carried out because these analyses would have lacked statistical power due to a small sample size (N ¼ 12). The GFP-GLMM analysis of cleaned FRPs was carried out using the fixed effect predictor Word Sentiment (levels: negative or positive) and the random effect predictor Subject (12 levels) for a subject-wise random intercept and subject-wise random slope. The GLMM was computed using the MATLAB function fitglme using a normal distribution, identity link function, and La Place fit method. After computing the GLMM across time points and electrodes, the fixed effect statistics for the Word Sentiment predictor were extracted. An alpha threshold of p < 0.05 was used and correction for temporal autocorrelation was based on a >40-ms duration criterion (i.e., >5 sampling points; Lehmann and Skrandies, 1980).

Electrode-by-time GLMM analysis
To compare the spatiotemporal differences between positive and negative sentiment during word processing, we carried out analyses across all trials from all subjects using the same GLMM as for GFP analysis. After computing the GLMM across time points and electrodes, the fixed effect statistics for the Word Sentiment predictor were extracted. An a priori alpha threshold of 0.05 was applied and correction for multiple comparisons across electrodes and time points was based on a >40-ms duration (i.e., >5 sampling points; Lehmann and Skrandies, 1980) and >5 electrodes (i.e., >5% of all electrodes, similar to Matusz et al., 2015) criterion. This measure is in line with the spatiotemporal clustering commonly observed for EEG event-related potentials (Mensen and Khatami, 2013;Murray et al., 2008).

Cluster-based permutation t-test
We complemented the electrode-by-time analysis with a cluster-based permutation t-test (Maris and Oostenveld, 2007), implemented in the FieldTrip toolbox for MATLAB (Oostenveld et al., 2011). The cluster-based permutation test was applied to the FRPs and tested for differences between the trials for the positive and negative sentiment condition. GLMM analysis is currently not available in the toolbox; hence, we used a paired-samples test between trials from the positive and negative sentiment condition. It showed highly similar results to the electrode-by-time GLMM (see Figs. 2 and 3). This analysis identified data samples showing significant t-values (p < 0.05, two-tailed) that were clustered based on temporal and spatial proximity (i.e. >4 neighboring electrodes). Each cluster was assigned to cluster-level statistics corresponding to the sum of the t-values of the samples belonging to that cluster. The type I error rate was controlled by evaluating the maximum cluster-level statistics by randomly shuffling condition labels 1000 times to estimate the distribution of maximal cluster-level statistics obtained by chance and applying a two-tailed Monte-Carlo p value. This procedure was applied at the sensor level in the time window from À600 to 1000 ms relative to fixation onset.

TCT
The TCT (K€ onig and Melie-García, 2010) is a permutation-based test that assesses the consistency of the topographical distribution of voltages across electrodes for repeated observations. In the present study, we were interested in the consistency of the difference between the positive and negative sentiment condition. Thus, we computed the difference between cleaned FRPs from the positive minus negative sentiment condition. The difference signals were computed by ranking the trials for each condition by subject, word length, and fixation duration, and then computing the differences between matched trials. In the case of an unequal number of trials between conditions, leftover trials were discarded (<1% of trials). We note that a GLMM analysis was not available for this test, but random effects between subjects were partially accounted for by computing the positive-negative differences within subjects. The topographic consistency test uses an electrode-level randomization method to estimate a data distribution under the null hypothesis, whereby at each time point the GFP of the grand average across trials is computed and compared to GFP values computed on 5000 random shuffles of electrode positions. The p value is the tail probability of the grand average GFP being larger than the permutation-based GFP values. The alpha threshold was set to p < 0.05, and correction for temporal autocorrelation was based on a >40-ms duration criterion (i.e., >5 sampling points, Lehmann and Skrandies, 1980).

Decoding analysis
We complemented the descriptive FRP analysis, which does not allow interpreting results beyond the data used for analysis, with a predictive analysis aimed at decoding word-level sentiment from unseen (hold-out) FRP trials (see Breiman, 2001;Yarkoni and Westfall, 2017 for a discussion). The decoding analysis was independent of the FRP analysis and had the goal to test whether there are any detectable differences in FRPs for reading words with positive vs. negative sentiment. This analysis did not aim at developing algorithms for brain-computer interfaces. Instead, we aimed at maximizing the sensitivity of our analysis for detecting sentiment-related differences in FRPs by using event-matched trials (i.e., 12,789 trials from 12 subjects, Fig. 1B) from cleaned FRPs, because visuo-motor artifacts of no interest were removed from these data. The analysis focused on the 0-500-ms post-fixation interval (63 sampling points), because pre-fixation EEG were unlikely to contribute to reading-related sentiment processing. Thus, a total of 6615 features were used (i.e., 105 EEG channels x 63 sampling points). Given that single-trial EEG has a low signal-to-noise ratio (SNR), due to high-amplitude environmental and physiological electrical signals not time-locked to fixation onset, we aimed to increase SNR by trial averaging. Similar to Tuckute et al. (2019), we compared decoding performance for single-trial to trial-averaged data. Trial averages were computed within subjects for random selections of 10, 20, or 40 trials of the same word-level sentiment (i.e., positive or negative) and similar word length. For example, a 40-trial average was computed on FRPs of 40 words of the same sentiment and similar word length that were presented in different sentences. A maximum of 40 trials per average were used (similar to Tuckute et al., 2019) as a trade-off between SNR improvement and reducing the number of trials available for decoding analysis. If the number of trials per subject was not evenly divisible by the desired number of trials per average (i.e., 10, 20, or 40 trials), the remaining trials were averaged and included in the analysis. Thus, a maximum of 12 trials (one per subject) contained fewer trials than the desired number of trials per average (i.e., 30-80% of the desired number of trials). The resulting number of samples was thus 12,789 samples for single-trial, 1300 samples for 10-trial, 650 samples for 20-trial, and 328 samples for 40-trial averages.
Subsequently, the data were randomly split 100 times into a training (95%) and test set (5%) using stratification. That is, the proportion of samples per subject and sentiment condition relative to the total number of samples was matched between training and test set (i.e., a majorityclass baseline classifier exhibited a mean test set accuracy of 0.5 [95% confidence interval [CI]: 0.5, 0.5], see Supplementary Material). For each data split, we normalized the features using scikit-learn (https://scikit -learn.org/stable/, StandardScaler class), a machine learning framework for Python, by computing feature-wise mean and standard deviation parameters from training set data. We subsequently applied these parameters for feature normalization of the training and test set. Parameter estimation was based only on the training set to prevent data leakage from training to test set (Kononenko and Kukar, 2007). This procedure was followed by dimensionality reduction, which served to avoid overfitting of classifiers because of the large number of available features (i.e., 6615 features) relative to a small number of samples (328-12,789 samples) in our dataset. Dimensionality reduction for the analyses reported in the main manuscript was performed using neighborhood component analysis (NCA; Goldberger et al., 2005), a supervised classification method based on k-nearest-neighbors classification that maximizes differences between classes for a desired number of k features (i.e., number of retained components after dimensionality reduction). We chose NCA instead of principal component analysis (PCA) because previous research has shown that for EEG data PCA mainly retains high-amplitude noise for the first principal components (e.g., Artoni et al., 2018), which is not the case for NCA (see Goldberger et al., 2005 for a comparison). We report in the Supplementary Material a comparison of decoding results for NCA and PCA, both of which showed similar results. Similar to feature normalization, we used scikit-learn (Neighbo-rhoodComponentAnalysis class) and only trained the NCA classifier on the training set, in order to prevent data leakage between training and test samples (Kononenko and Kukar, 2007). The number of k features to retain after dimensionality reduction using the NCA classifier was determined during model optimization (see below). Subsequently, decoding analyses using support vector machine (SVM) and logistic regression classifiers were performed. We also report analyses using long short-term memory (LSTM), dense neural networks, and a majority-class classifier in the Supplementary Material. An SVM model is a representation of the samples as points in space, mapped so that the samples of the separate categories are divided by a clear gap that is as wide as possible. New samples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall (Raschka, 2018;Tuckute et al., 2019). SVM classifiers have been previously used successfully on single-trial and trial-average EEG data (e.g., Tuckute et al., 2019). We used SVM classifiers implemented in scikit-learn (SVC class) that employ a radial basis function kernel, which is well suited for finding non-linear decision functions, by using a value of 1 for the regularization parameter "C" and a value of 0.03 for the kernel width parameter "γ" (similar to Tuckute et al., 2019) or using parameter optimization using grid search in scikit-learn (GridSearchCV class). Parameter optimization was based on 10-fold cross-validation of the training set using grid search across the hyperparameters: number of NCA components [10,20,40,80,160,320,640,1280], regularization parameter C [ten values in the range: 1 Â 10 À7 ; 1 Â 10 2 ], and kernel-width parameter γ [ten values in the range: 1 Â 10 À7 ; 1 Â 10 2 ]. The model achieving the highest average accuracy on validation sets across 10 cross-validation folds was subsequently used for model testing. We observed a median number of k ¼ 320 features across data splits for the final model used for testing. The optimal parameters were subsequently used to train an SVM classifier on the entire training set. The SVM classifier was subsequently used for predicting the class labels (i.e., word sentiment) of the test set samples. For the logistic regression classifier, we also used the scikit-learn implementation (LogisticRegression class) and followed the same training and parameter optimization procedure as for SVM classifiers, except for the difference that no γ parameter is required for logistic regression and for the use of an L2-norm.
Decoding performance for SVM and logistic regression classifiers was evaluated based on the comparison between predicted labels (i.e., positive or negative sentiment) and true labels of the test set resulting in a number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across the classified trials. The following decoding performance metrics were computed: We report mean classification performance and 95% confidence intervals of the mean across 100 random splits for training and test set (Table 1).

Control analysis
We conducted additional control analysis of eye-tracker data and linguistic features to exclude that systematic differences in eyemovement behavior and linguistic material selected for the positive and negative sentiment condition confounded our EEG analyses. Eye-tracker data were analyzed by extracting horizontal and vertical eye velocity, computed as the speed of gaze position-changes along the horizontal and vertical axis of the screen over time. We computed the first derivative on gaze position (in screen coordinates) and segmented the resulting eye velocity data in À600 to 1000 ms peri-fixation epochs (no baseline correction). Statistical analysis was performed separately for horizontal and vertical eye velocity epochs using timewise GLMMs (model formula and parameters identical to those used for FRP Analysis, see below) across the 12,789 fixation trials from all subjects. Correction for multiple comparisons was performed using a minimum 40-ms duration criterion (>5 sampling points, Lehmann and Skrandies, 1980). Next, we extracted fixation onset probabilities, which are the probability distributions across time of the n AE 3 fixations preceding or following the current fixation (n). Fixation onset probabilities provide information about the proportion of trials out of the total number of trials showing a fixation onset, and thus additional saccadic and visual-evoked electrical responses time-locked to these fixations, within a given time interval relative to the current fixation (see Dimigen et al., 2011 for a similar approach). Any difference in fixation onset probabilities between the positive and negative sentiment condition would indicate differences in the amount of visual and motor artifacts between conditions and could, therefore, confound the FRP analysis (especially for data not cleaned by deconvolution modeling) in identifying FRP differences related to lexico-semantic processing. We extracted fixation onset probabilities for both no-word fixations and word fixations of any sentiment occurring in the À600 to 1000 ms peri-fixation interval for all trials of interest (i.e., positive and negative word fixation trials from event selection, see below). First, we computed fixation onset times (in ms) of all preceding and subsequent fixations relative to current fixation onset. We then computed for each condition the probability of fixation onset in consecutive and non-overlapping 60-ms bins as the number of fixations per bin divided by the number of trials per condition across the entire À600 to 1000 ms peri-fixation interval. The data were statistically analyzed using bin-wise paired samples t-tests (p < 0.05; >1-consecutive-bins criterion for multiple comparison correction).
In addition, we compared word length and fixation duration between positive and negative sentiment trials using the same GLMM model as used for FRP Analysis. Finally, we compared word frequencies between words with positive and negative sentiment from the Sentiment Reading task. Among the 7129 words, from 400 sentences of the Sentiment Reading task, there were 2475 unique words, of which 275 (11%) had a negative, 400 (16%) had a positive, and 1800 (73%) had a neutral sentiment. We note that this imbalance in the number of unique words for the positive and negative sentiment condition is comparable to the imbalance in fixation trials before event matching (see Fig. 1B). However, after event matching the number of trials used for FRP analysis was matched between the positive and negative sentiment condition. We then extracted word frequencies for unique words by counting the number of times each word occurred in the text material from the Sentiment Reading task (7129 words). We statistically compared the frequencies of words with a positive sentiment and words with a negative sentiment using a two-samples t-test (p < 0.05).

Results
In this study, we assessed the electrical neural correlates of word sentiment processing during naturalistic reading by testing for differences between negative and positive sentiment in reading-related FRPs (FRP analysis) and by predictive modeling aimed at predicting the sentiment of the word from FRP data (decoding analysis). Finally, we compared eye movement and linguistic features between sentiment condition to exclude their confounding contribution the EEG analyses (control analysis).

FRP analysis
We performed the FRP analysis on two independent aspects of the global electrical field: response strength and response topography (Brunet et al., 2011;Murray et al., 2008;Tzovara et al., 2012). Response strength was assessed by time-wise GFP analysis using a GLMM to control for random subject effects. We complemented this analysis with an electrode-by-time GLMM and a nonparametric cluster-based permutation test to identify physiologically plausible electrode clusters of response strength differences between sentiment conditions (Maris and Oostenveld, 2007). Finally, we tested for differences in response topography between the positive and negative sentiment conditions using the TCT (K€ onig and Melie-García, 2010). We will first present the results for FRPs, which contain oculomotor, visual, and lexico-semantic processing related signals, followed by the results for cleaned FRPs, which are FRPs from which high-amplitude oculomotor and visual processing related signals were removed by deconvolution modeling.

FRP results
The timewise GLMM analysis of GFP of FRPs showed no statistical differences between the positive and negative condition (Fig. 2D). The electrode-by-time GLMM analysis of FRPs showed a statistical difference between the positive and negative condition in the 224-280 ms interval after fixation onset (condition fixed effect: p < 0.05, >5 electrodes, >40 ms; Fig. 2E). The electrodes that showed significantly different activations in this interval were located in left frontocentral electrodes-they presented higher activation for the negative compared to the positive sentiment condition-and in a right-posterior electrode cluster, which exhibited higher activation for the positive compared to the negative sentiment condition (Fig. 2F). Similarly, the cluster permutation test identified two significant electrode clusters (Fig. 2E-F). The first cluster was located in left frontocentral electrodes at 232-280 ms after fixation onset and showed a negative activation difference between the positive and negative sentiment conditions (cluster-level p ¼ 0.04). The second cluster was located in right-posterior electrodes at 184-304 ms after fixation onset and showed a positive activation difference between  Horizontal and vertical eye velocity statistical comparison between the positive and negative sentiment condition. C. Fixation onset probabilities for AE3 fixations relative to current fixation (n) and statistical comparison of fixation onset probabilities within the À600 to 1000 ms peri-fixation interval (60-ms bins) between the positive and negative sentiment condition. D. Word length distributions before and after event matching for negative (black) and positive (red) sentiment conditions. E. Average fixation duration for the positive and negative sentiment conditions. F. Average word frequency for words with positive and negative sentiment. the positive and negative sentiment condition (cluster-level p < 0.001, Fig. 2F). Both clusters overlapped with the significant electrode clusters identified by the electrode-by-time GLMM analysis (Fig. 2E). The TCT identified a consistent topographical difference between FRPs for the positive and negative conditions in the 224-304 ms interval after fixation onset (p < 0.05, >40 ms, Fig. 2F).

Cleaned FRP results
The timewise analysis of GFP of cleaned FRPs showed no statistical differences between the positive and negative sentiment conditions (Fig. 3D). However, in line with the results observed for FRPs, the electrode-by-time GLMM analysis of cleaned FRPs showed statistical differences between the positive and negative conditions in the 224-304 ms interval after fixation onset in two electrode clusters. The first cluster was located in left-central scalp locations and showed higher activation for the positive compared to the negative sentiment condition (clusterlevel p ¼ 0.009). The second cluster was located at the right-posterior scalp location and showed lower activation for the positive compared to the negative sentiment condition (cluster-level p < 0.0001; Fig. 3E). The TCT identified a consistent topographical difference between cleaned FRPs for the positive and negative condition in the 224-304 ms interval after fixation onset (p < 0.05, >40 ms; Fig. 3D). Taken together, these results indicate topographical but not amplitude (GFP) differences in cleaned FRPs between the positive and negative sentiment condition.
In summary, the results from FRPs and cleaned FRPs were highly similar. These findings indicate that the presence of the high-amplitude visual-motor activation in FRPs did not affect the statistical results of our analysis. At the same time, these results rule out that our cleaning procedure artificially induced statistical differences between the positive and negative condition.

Decoding analysis
The results of decoding analysis of word sentiment from cleaned FRPs using the SVM and logistic regression classifiers are shown in Table 1. This analysis showed for single trials a chance-level performance (mean accuracy ¼ 0.50, 95% CI: [0.49, 0.51]), possibly related to the low SNR in single-trial EEG (e.g., Tuckute et al., 2019). However, decoding analysis of trial-average data of 20 or more trials improved decoding performance to an above-chance level (0.60 mean accuracy; 95%CI: [0.56, 0.60]; Table 1). This finding indicates that increasing SNR improved sentiment decoding from cleaned FRPs. We report extensive analysis comparing different parameters for feature dimensionality reduction, classifiers, and tuning parameters in the Supplementary Material.

Control analysis
All control analyses were carried out on the final selection of events after fixation-event selection was performed (see Fig. 4B). The timewise analysis of horizontal and vertical eye velocity showed no statistical differences between the positive and negative condition in the entire À600 to 1000 ms peri-fixation interval (p values are shown in Fig. 4A-B). Likewise, bin-wise analysis (i.e., 60-ms bins) of fixation onset probabilities showed no significant differences between the positive and negative condition (p values shown in Fig. 4C). Finally, comparison of word length, fixation duration, and word frequency did not significantly differ between the positive and negative condition (Fig. 4D-F). In summary, the results of the control analysis showed no differences between the positive and negative sentiment condition in eye movements or linguistic features.

Discussion
This study used synchronized EEG and eye-tracking data to investigate the neural dynamics of word-level sentiment processing in humans reading naturalistic English sentences. Our results showed differences in the electrical neural responses to words with positive vs. negative sentiment that were reflected in an FRP topographical differences at 224-304 ms after fixation onset. Decoding analysis showed a consistent above-chance level decoding of the word sentiment based on cleaned FRP data (mean accuracy of 0.60). Our control analyses ruled out that these results were based on differences in eye movements or linguistic features between the positive and negative sentiment condition. In the following section, we will discuss these results with respect to previous research and add a methodological examination of advantages and limitations of our methods for naturalistic neuroimaging research.

Neural dynamics of word sentiment processing during naturalistic reading
Naturalistic reading is a complex multicomponent process that involves a temporal sequence of oculomotor, visual-perceptual, and cognitive processes for converting visual information into semantic information embedded into contextual memory. Thus, the brain processes of reading involve orthography, phonology, and semantic processing of single words, as well as processes relating this word-level information to grammar and lexico-semantic information of phrases and entire sentences (Citron, 2012;Hasson and Honey, 2012).
Neurophysiological studies of reading using word-by-word presentation demonstrated that approximately 100 ms after word presentation the visual input reaches the visual cortex. Around 50-100 ms later, the word is processed as strings of letters in a specialized region of the left visual cortex, and between 200 and 600 ms after a word is presented, its semantic properties are processed (Citron, 2012;Grainger and Holcomb, 2010;Salmelin, 2007). These findings have been substantiated by research showing sustained activity when reading words vs. non-words (Salmelin, 2007). Other studies have found a mismatch negativity response called N400 during this time period; this response is stronger for words that are semantically incongruent with previously presented words (Hillyard and Kutas, 2002). More recently, it has been shown that the N400 is a continuously graded response that depends on how surprising the word is (Frank et al., 2013). In contrast to single-word reading, naturalistic reading is characterized by a reader's spatiotemporal control. Readers move their eyes actively through text in a series of fixations and saccades (Dodge, 1901;Rayner and Clifton, 2009). Previous studies have suggested that the majority of word encoding and semantic language processing steps occur during word fixation (Clifton et al., 2016;Rayner and Clifton, 2009). In our study, the participants had an Table 1 Sentiment decoding performance for cleaned FRP single trials and multi-trial averages for different classifiers. The first value represents the mean score across 100 test set classifications, and values in brackets refer to the 95% confidence interval. Differences from chance are highlighted bold. average reading speed of 5.5 words per second (Hollenstein et al., 2018) during self-paced naturalistic reading; this speed allowed them to extract semantic meaning from text (e.g., sentiment).
In the present study, we were specifically interested in spatiotemporal dynamics of sentiment word processing during naturalistic reading. A large body of literature has studied the neural dynamics of written sentiment word processing. The vast majority of existing findings are derived from controlled experiments (serial single word presentation and fixed presentation time), which may not generalize beyond the experimental setting (Hasson and Honey, 2012; see also our discussion of methodological considerations below). Here we primarily focus on the electrophysiological findings. For an overview of the hemodynamic neuroimaging (fMRI) studies, please refer to the review of Citron (2012).
Within the EEG literature, two event-related potential components, the early posterior negativity (EPN) and the late posterior positivity (LPP), have been repeatedly reported in the context of sentiment processing during reading (for reviews, see Citron, 2012;Kissler et al., 2006). The EPN has an occipital-temporal scalp distribution that peaks between 200 and 300 ms after word presentation. The EPN has been linked to attentional mechanisms during access to sentiment information (Schupp et al., 2004); this phenomenon suggests that this component is involved in implicit processing of emotional content. The EPN amplitude is reportedly increased for emotionally connotated words compared to neutral words during reading (Kissler et al., 2009(Kissler et al., , 2007, word recognition (Hinojosa et al., 2010), and lexical decision making (Citron, 2011;Schacht and Sommer, 2009;Scott et al., 2009). Although the EPN has mostly been examined with written verbal material, some studies have identified the EPN also in response to emotional pictures and faces (Martín-Loeches, 2007). Source localization of the EPN component has revealed that the EPN originates in the fusiform gyrus (Schacht and Sommer, 2009), or the visual word-form area (Hinojosa et al., 2010). These data support the hypothesis that a word's emotional connotation can be processed in parallel with the representation of its visual form .
Our FRP analyses showed a topographical difference between the positive and negative sentiment conditions in an identical time window (224-304 ms after fixation onset), with a similar spatial scalp distribution as the EPN. Positive sentiment words exhibited higher activation compared to the negative words in a left temporal electrode cluster, as well as lower activations for the positive compared to the negative sentiment condition in a right occipital electrode cluster (Figs. 2F and 3F). As described earlier, this time period after fixation onset (200-300 ms) is generally associated with the processing of semantic properties (e.g., sentiment), such as integrating the visual stimulus with its corresponding lexical representation (e.g., Abdullaev and Posner, 1998). We hypothesize that spatiotemporal differences between the positive and negative sentiment word processing in the present study are closely related to the reported EPN component. Palazova et al. (2011) observed a topographical difference in FRPs similar to our study. Using single-word presentations, these authors found that at 300 ms after word onset there was a FRP topographical difference between adjectives with positive vs. negative sentiment with a centro-posterior topographical pattern similar to our study. However, their voltage pattern was different from our study (Palazova et al., 2011, Fig. 2). These results are interesting in that Palazova et al. (2011) and our study both found word-level sentiment differences in FRPs in a similar time period. However, the results of the two studies should be compared with caution because there are many methodological differences between the studies that may have affected group-level results. Those include the subject sample (native German vs. English speakers), the text material (German adjectives vs. English adjectives, verbs, and nouns), the experimental task (timed single-word presentation vs. self-paced naturalistic sentence reading), and the EEG recording settings (left mastoid reference and frontal ground electrode vs. vertex reference and posterior-central ground). In order to directly compare the paradigms of Palazova et al. (2011) and our study, both should be carried out in the same subjects, using similar linguistic material and EEG recording settings. Future work should investigate the relationship between emotional valence and linguistic features (such as word class) during naturalistic sentences reading. The timing of our results in the range of 200-300 ms after fixation onset are also compatible with an alternative explanation, namely cognitive-processing-related P300-N400 components. However, the timing of the sentiment effect in the present study and the topographical pattern (more posterior and more asymmetric) differed from N400 effects of word frequency or predictability (e.g., Dimigen et al., 2011), an outcome that suggests different underlying neural dynamics.
Another frequently reported event-related potential component connected to emotional word processing is the LPP (sometimes also called the late positive complex). The LPP has a centro-parietal scalp distribution and peaks between 500 and 800 ms. The LPP has been associated with sustained processing of emotional content of verbal stimuli as it has shown larger amplitude in emotional words compared to neutral words (Carreti e et al., 2008;Hinojosa et al., 2010;Kanske and Kotz, 2007;Schacht and Sommer, 2009). Some studies have reported LPP amplitude differences between stimuli with positive vs. negative valence (Herbert et al., 2008Kissler et al., 2009;Palazova et al., 2011) and have suggested that the LPP is involved when more controlled, explicit cognitive processes occur. In our study, the electrode-by-time analysis showed an electrode cluster at 400-500 ms-somewhat close to the time period of the LPP-that did not show statistical differences between sentiment condition (see Figs. 2E and 3E).
Regarding the relationship between EPN and LPP, Citron (2012) speculated that the early EPN component rather reflects the processing of arousal, while the LPP is involved in the processing of valence. However, the clear distinction between arousal and valence has been a source of debate. Lang et al. (1997) considered valence and arousal intrinsically associated. For example, emotionally valanced and neutral stimuli do not only differ along the arousal dimension, but also in terms of valence. Therefore, it has been suggested that the EPN effect can be seen as a more general "emotionality" effect, in which valence and arousal are integrated (Citron, 2012). Furthermore, our focus was to compare the processing of words with positive vs. negative sentiment. Although we cannot entirely exclude the possibility of small disparities in arousal related to reading words with positive vs. negative sentiment, these differences were likely to be small when compared to the strong differences in valence associated to words with positive vs. negative sentiment. Our study investigated differences in word sentiment for dichotomous categories (positive or negative). It is interesting to consider potential gradual differences between negative and positive sentiment. However, in our stimulus material there was an imbalance between the number of strongly positive/negative (19%) and moderately positive/negative (81%) trials. Hence, we cannot reliably model such gradual differences.
Our results indicated topographical differences in FRPs related to word-level sentiment. Other linguistic features may affect FRPs, for instance, word length and word frequency, which we addressed by our stimulus selection procedure and control analysis. Other properties of the linguistic material or reading behavior may have contributed to our results, such as the linguistic context at the phrase and sentence level or the word order (Hasson et al., 2015). Such behavioral-linguistic features may play an important role in naturalistic sentence reading, and future work should investigate the embedding of word-into phrase-and sentence-level processing using study designs tailored to address this research question. Recent studies have embarked on unraveling the cognitive and neural processes underlying lexico-semantic processing that relate word to phrase and sentence level processing (e.g., Yeshurun et al., 2017). For instance, in fMRI, Lerner et al. (2011) presented auditory stories scrambled at the sentence, phrase, or word level and found evidence for a hierarchical involvement of early sensory regions to more upstream areas such as the temporal and frontoparietal regions for phrase and sentence level processing. These results suggest that language processing involves multiple levels of processing at different temporal scales (Hasson and Honey, 2012). We could not perform such an analysis with our data because our participants always read meaningful sentences at their own pace. Nonetheless, we addressed this issue in supplementary analyses by using a GLMM including word-, phrase-, and sentence-level predictors of sentiments. This analysis showed no statistical differences for phrase and sentence predictors, while the same statistical difference between the positive and negative sentiment conditions was observed for the word sentiment predictor. These results should not be taken to imply that no such processing occurred. Rather, by focusing our analysis on FRPs for single words carefully selected to match for low-level oculomotor and linguistic properties, we focused our analysis mainly on the difference between positive and negative sentiment at the word level. Future studies should investigate the phrase and sentence level processing of linguistic sentiment.
In summary, our results provide evidence for the existence of a specific temporal window (224-304 ms after fixation onset) and a topographical difference (i.e., different underlying neural generators) for processing positive vs. negative sentiment of words during naturalistic reading. These results from FRP analyses were supported by independent decoding analyses of word sentiment from FRPs (discussed in the next section).

Sentiment decoding from FRPs
The second aim of our study was to assess whether there are any differences in FRPs for reading words with positive vs. negative sentiment that can be decoded from unseen (hold-out) FRP trials. It is important to note that the decoding analysis was independent of the FRP analysis and did not aim at developing algorithms for brain computer interfaces. That endeavor would require the capability of single-trial decoding in noisy environments, an application that is beyond the scope of our study. Instead, we conducted the decoding analysis on carefully selected fixation events, on cleaned FRPs, and on trial-average data, which served to increase the sensitivity of our analysis for detecting sentiment-related differences in FRPs. Our results showed a chance-level decoding performance for single-trial data. These results indicate that classifiers were unable to decode the word sentiment from unseen singletrial FRPs. These findings may be related to the low SNR generally observed in single-trial EEG and to the fact that participants performed sentence reading, which involves phrase-and sentence-level semantic processing that possibly interfere with the ability of classifiers to decode word-level sentiment from FRPs (Hasson et al., 2015). However, we observed an above-chance level decoding performance when decoding analysis was based on 20-or 40-trial averages across the same word-level sentiment. These results indicate that by increasing SNR via trial-averaging, the ability to decode the word sentiment from unseen FRP data was improved. These results are similar to the study by Tuckute et al. (2019), who observed an improvement of decoding performance from single-trial to 30-trial averages. These results highlight the importance of SNR for brain-based prediction of semantic processing. We note that in our study the level of decoding accuracy was low (0.60 mean accuracy for 40-trial averages) and based on large amounts of data (trial averages across 12,789 trials from 12 subjects). However, due to the limited number of 12 subjects in our study, future confirmatory studies are required to replicate and extend our results in larger subject samples. Given these limitations, naturalistic reading-related FRPs as used in our study may not be suited for single-trial brain-computer interface applications (Hebart and Baker, 2018). Our analyses focused on maximizing the sensitivity for detecting sentiment-related differences in FRPs by using carefully selected stimuli, cleaned FRPs, and trial-averages. This approach is very different from brain-computer interface research that aims at developing paradigms and classifiers that operate on single-trial data in noisy environments. Therefore, our classifiers are unlikely to succeed in brain-computer interface applications, and more work is needed to address these needs.
Previous research on sentiment analysis in natural language processing has traditionally been based on word-level and sentence-level linguistic features (e.g., Liu and Zhang, 2012;Pang and Lee, 2008). More recent work has used eye-movement data during reading as features to enhance sentiment decoding performance (Mishra et al., 2016;Tomanek et al., 2010;Xu et al., 2015). Only recently have EEG signals been considered for decoding sentiment polarity from words. For instance, Gu et al. (2014) recorded EEG from three humans during an experimentally-controlled word reading and mental imagery task. Classification analysis used the EEG features (electrode by time points) in a 1.5-sec temporal window after stimulus onset to predict the sentiment extracted from sentiment dictionaries. The classification performance ranged from 0.50 to 0.60 accuracy. Our results for sentiment decoding from single words are highly comparable to Gu et al. (2014) in that we observed an above-chance level predictive performance accuracy of 0.51-0.58, despite the fact that in our experiment sentiment processing occurred implicitly during natural reading; it was not related to an explicit task instruction as in Gu et al. (2014). Sentiment decoding has also been performed on video material. For instance, Wang et al. (2014) and Nie et al. (2011) used EEG frequency components and different machine learning algorithms to decode the sentiment expressed in movie scenes. These studies focused on short movie clips and used as features the EEG segmented in 500-ms to 2-sec intervals. This approach provided classification accuracies between 0.50 and 0.81 across frequency bands and showed the best decoding performance in alpha (8-13 Hz) and beta band (13-30 Hz) frequency ranges. It is interesting that sentiment classification on word stimuli (Gu et al., 2014 and our study) showed lower classification accuracy than sentiment decoding from videos (Nie et al., 2011;Wang et al., 2014). This phenomenon may be related to the more engaging nature of a movie and the fixed timing of the visual stimuli as compared to linguistic stimuli processed in a self-paced fashion during naturalistic reading. This proposal is supported by previous studies showing that watching movie scenes evokes a high inter-subject correlation of electrical and functional neural activity (Gravens et al., 2011;Kauppi and Kauppi, 2010).
Data-driven classification analysis is sensitive to the amount of noise in the data (Delorme et al., 2007). Thus, computing trial averages leads to an increase of SNR that can substantially improve classification performance. In line with this idea, we found an improvement in the decoding accuracy from single-trial to 40-trial averages for both the SVM and logistic regression classifiers (Table 1). Tuckute et al. (2019) observed similar results for EEG-based decoding of image animacy from visual-evoked potentials using an SVM classifiers. These authors found an improvement in classification accuracy from single trial (mean accuracy of 0.54-0.61) to trial averages (mean accuracy of 0.50-0.90), which was higher than observed in our study-probably related to using picture stimuli instead of word reading data (as discussed above). Moreover, we found that the linear classifiers, SVM, and logistic regression performed better on our data than more complex architectures LSTM and DNN (see Supplementary Material). This result may be related to the comparably low number of trials in our data (12,789 trials) relative to the number of features (105 electrodes x 63 sampling points ¼ 6615 features). For such data, SVMs have previously been shown to perform better than LSTMs (Arora et al., 2018;Güler and Koçer, 2005;Subasi, 2013). More complex architectures have been successfully applied to EEG data in other contexts (e.g., Gupta et al., 2019;Khurana et al., 2018).

Methodological considerations of naturalistic imaging for reading and sentiment processing
Traditional electrophysiological and functional neuroimaging studies typically consist of highly controlled experiments that vary among a few conditions. Controlled experiments are necessary in order to make accurate inferences; they enable the researcher to isolate a specific task while controlling for all other confounding variables. However, the stimuli for these conditions are artificially designed and, therefore, might result in conclusions that are not generalizable to how the brain works in real life (Wehbe, 2015). While controlled experiments allow the experimenter to make precise testable conclusions about the involved brain regions, they are not sufficient for understanding how complex cognitive tasks (e.g., naturalistic reading) are processed (Hasson and Honey, 2012). When studying language processing, for example, very few experiments have presented subjects with text encountered in everyday life. Instead, they have presented carefully designed stimuli. Based on these studies, it remains challenging to conclude how the multiple processes involved in reading work together and integrate, specifically when isolating one process at a time and keeping everything else constant. This issue might contribute to the current situation, where there is no convergence on a single model of how the brain extracts meaning from language (Friederici, 2012;Hagoort, 2013;Hickok and P€ oppel, 2007). This deficit can be at least partly attributed to the difficulty of knowing how such a complex multicomponent process operates by isolating one of its subprocesses at a time (Wehbe, 2015).
In 1973, Newell had already highlighted the difficulty of combining the findings of a series of cognitive science experiments and advocated to select "a single complex task and do all of it" (Newell, 1973). This idea has also been highlighted in vivid detail by the fable of the six blind men and an elephant, in which blind men fail to come to an agreement on a perception of the elephant after each of them perceives only one body part of the elephant (Goldstein, 2009, p. 492). Thus, there is increasing interest in research of naturalistic human behavior because these conditions are more ecologically valid compared to traditional experimental research paradigms that use highly constrained stimulus material and frequent repetitions. Recent studies have subjects watching videos (Nishimoto et al., 2011;Petroni et al., 2018), solving math problems , and listening to stories (Brennan et al., 2012;Broderick et al., 2018). Naturalistic experiments promise to deliver insights into human perceptual-cognitive decision making that not only provide a better approximation for identifying the cognitive and neural processes related to real-world human behavior, but can also be used to improve decision making of computer algorithms (e.g., for natural language processing).
In order to derive a more complete picture of the underlying neural processes of reading, increased research effort has been made to study naturalistic reading. In a series of simultaneous eye-tracking and fMRI naturalistic reading studies, neural correlates of the effects of word length, frequency, and predictability on brain responses during naturalistic reading have been identified (Desai et al., 2018;Henderson et al., 2016Schuster et al., 2016). There are several challenges associated with naturalistic reading. First, the common fMRI sequences typically acquire an image only every 2 s and measure a delayed smooth hemodynamic response. This function might be too slow for the dynamic process of reading at a natural pace and unable to identify the contributions of individual words or concepts to brain activity. However, new sophisticated methodological approaches (e.g., co-registering eye-movement or fast fMRI) have the ability to overcome these difficulties (e.g., Desai et al., 2018;Jones et al., 2001;Schuster et al., 2016;Yarkoni et al., 2008). Another issue is the fact that both fMRI and EEG are noisy imaging tools. Multiple repetitions are often necessary to produce a reliable representation of a cognitive process. Repetitions reduce stimulus diversity and decrease the ecological validity. In the present study, we avoided repetitions by presenting each sentence only once; we simultaneously achieved an increased SNR by computing averages across the word sentiment condition. Finally, another difficulty of naturalistic reading experiments is that various processes occur concomitantly during reading, including oculomotor behavior, visual processing, and sentiment processing. This fact makes it more difficult to separate activity patterns mediating linguistic information processing from areas mediating other co-occurring processes. In the present study, we chose to control and correct for various confounding parameters, such as linguistic properties as well uncontrolled eye movement, visual stimulation, and the embedding of currently read text into the lexico-semantic context of previous memory (e.g., Hasson et al., 2015). A particular concern for our study was that temporal overlap of high-amplitude visual-motor components precludes the detection of sentiment-related differences in FRPs (Dimigen et al., 2011;Ehinger and Dimigen, 2019;Smith and Kutas, 2015b). We therefore conducted analyses for cleaned and uncleaned FRPs by deconvolution modeling and found highly comparable statistical results. The absence of differences between the analysis may be related to the fact that we employed an event matching procedure where word fixation duration and therefore the temporal onset of subsequent fixations were matched across conditions. If one were to model the EEG by deconvolution modeling for unmatched conditions, or fewer trials from which to select the data, EEG deconvolution modeling may be of greater benefit. Alternatively, using deconvolution modeling to directly compare betas between different conditions may be applicable for between-subjects statistical analyses (Ehinger and Dimigen, 2019). This procedure has been previously used for modeling source-level activation from EEG (Brodbeck et al., 2018). Our results indicated that the FRP differences between positive and negative sentiment processing are not merely a function of high-amplitude visuo-motor activity. Rather, they reflect lexico-semantic processing of the word content. We note that the small sample size (N ¼ 12) in our study limits the generalization ability of our results and did not allow us to investigate between-subjects random effects. A replication study based on our methods in a larger subject sample is desirable.

Conclusion
This study successfully identified the spatiotemporal neural dynamics of sentiment processing during naturalistic reading of English sentences. Combining high-density EEG and eye-tracking data, we showed that individual words of positive vs. negative sentiment evoke a consistent topographical difference in the FRPs at 224-304 ms after fixation onset. The FRP signal in this time period allowed decoding the word sentiment with an above chance-level performance, when considering FRP averages of 20 or more trials. Our results provide a proof of concept that the combination of state-of-the-art electrical neuroimaging and decodingbased analysis can serve to identify the neural dynamics of naturalistic stimulus processing in humans, which in turn can help to improve computer algorithms for natural language processing. This endeavor will advance our understanding of how the human brain extracts the meaning from written text under ecologically valid conditions.

Funding
This work was supported by the Swiss National Science Foundation [100014_175875].

Declaration of competing interest
The authors declare no competing interests.