A New Type of Eye Movement Model Based on Recurrent Neural Networks for Simulating the Gaze Behavior of Human Reading

Traditional eye movement models are based on psychological assumptions and empirical data that are not able to simulate eye movement on previously unseen text data. To address this problem, a new type of eye movement model is presented and tested in this paper. In contrast to conventional psychology-based eyemovementmodels, ours is based on a recurrent neural network (RNN) to generate a gaze point prediction sequence, by using the combination of convolutional neural networks (CNN), bidirectional long short-term memory networks (LSTM), and conditional random fields (CRF). The model uses the eye movement data of a reader reading some texts as training data to predict the eye movements of the same reader reading a previously unseen text. A theoretical analysis of the model is presented to show its excellent convergence performance. Experimental results are then presented to demonstrate that the proposed model can achieve similar prediction accuracy while requiring fewer features than current machine learning models.


Introduction
Using computers to simulate humans or to reproduce certain intelligent behaviors related to human vision is a typical computer vision task [1], such as simulating eye movements in reading.However, reading is complex cognitive behavior and the underlying cognitive process occurs only in the brain [2].Modeling such behavior requires obtaining some explicit indicators via such methods as eye tracking.
When reading a text, the eyes of a skilled reader do not move continuously over the lines of text.Instead, reading proceeds by alternating between fixations and rapid eye movements called saccades [3].This behavior is determined by the physiological structure of the human retina.Most of the optic nerve cells are concentrated in the fovea and only when the visual image falls in this area can it be "seen" clearly.Unfortunately, the fovea only provides about a 5degree field of view [4].Therefore, the reader needs to change the fixation point through successive saccades so that the next content falls on the fovea region of the retina.By analyzing eye movements during reading, we can quantify the reader's actions and model for reading.Eye tracking helps researchers to determine where and how many times subjects focus on a certain word, along with their eye movement sequences from one word to another [5].Figure 1 shows an example eye movement trajectory from an adult reader.
Models of eye movement control have been studied in cognitive psychology [6][7][8][9].Researchers integrated a large amount of experimental data and proposed a variety of eye movement models such as easy-rider (E-Z) reader [10] and saccade generation with inhibition by foveal targets (SWIFT) [11].Although these eye movement models typically have parameters that are fit to empirical data, their predictions are rarely tested on unseen data [12].Moreover, their predictions are usually averaged over a group of readers, while eye movement patterns vary significantly between individuals [13].Predicting the actual eye movements that an individual will make while reading a new text is arguably a challenging problem.
Some recent work has studied eye movement patterns from a machine learning perspective [14][15][16][17].These studies were inspired by recent work in natural language processing 2 Complexity

K a t e q u i v e r e d a n d w e n t t o t h e w i n d o w
:fixation (NLP) and are less tied to psychophysiological assumptions about the mechanisms that drive eye movements.The work presented in [14] was the first to apply machine learning methods to simulate human eye movements.The authors used a transformation-based model to predict word-based fixation of unseen text.Reference [17] applied a conditional random field (CRF) model to predict which words in a text are fixated by a reader.However, traditional supervised learning requires more features and preprocessing of data, which may lead to high latency in human-computer interaction applications.
Aided by their parallel distributed processing paradigm, neural networks have been widely used in pattern recognition and language processing because of their parallel distribution [18].In 2010, Jiang [19] studied how to apply neural networks to multiple fields and provided a review of the key literature on the development of neural networks in computeraided diagnosis.In 2011, Ren [20] proposed an improved neural network classifier that introduced balanced learning and optimization decisions, enabling efficient learning from unbalanced samples.In 2012, Ren [21] proposed a new balanced learning strategy with optimal decision making that enables effective learning from unbalanced samples and is further used to evaluate the performance of neural networks and support vector machines (SVMs).
In recent years, deep neural networks (DNN) have become a popular topic in the field of machine learning.DNN has successfully improved the recognition rate and some excellent optimization algorithms and frameworks have been proposed and applied.Guo (2018) proposed a novel robust and general vector quantization (VQ) framework to enhance both robustness and generalization of VQ approaches [22].Liu (2018) presented an efficient bidirectional Gated Recurrent Unit (GRU) network to explore the feasibility and the potential of mid-air dynamic gesture based user identification [23].Liu (2019) presented an end-to-end multiscale deep encoder (convolution) network, which took both the reconstruction of image pixels' intensities and the recovery of multiscale edge details into consideration under the same framework [24].Wu (2018) proposed an unsupervised deep hashing framework and adopted an efficient alternating approach to optimize the objective function [25].Wu (2019) proposed a Self-Supervised Deep Multimodal Hashing (SSDMH) method and demonstrated the superiority of SSDMH over state-of-the-art cross-media hashing approaches [26].Luan (2018) developed a new type of deep convolutional neural networks (DCNNs) to reinforce the robustness of learned features against the orientation and scale changes [27].Besides, some methods based on recurrent networks have been proposed, developed, and studied for natural language processing [28][29][30].
In this paper, we formalize the problem of simulating the gaze behavior of human reading as a word-based sequence labeling task (which is a classic NLP application).In the proposed method, the eye movement data of a reader reading some texts is used as training data and a bidirectional Long Short-Term Memory-Conditional Random Field (bi-LSTM-CRF) neural network architecture is used to predict the eye movement of the same reader reading a previously unseen text.The model is focused on achieving similar prediction accuracy while requiring fewer features than existing methods.However, it is worth emphasizing that in this study we focus only on models of where the eyes move during reading, and we will not be concerned with the temporal aspect of how long the eyes remain stationary at fixated words.
The remainder of this paper is organized into the following sections.Section 2 introduces the problem formulation.Section 3 proves the convergence of the model.Section 4 describes the layers of our neural network architecture.Section 5 discusses the training procedure and parameter estimation.Section 6 demonstrates the superiority of the proposed method with verification experiments.Section 7 concludes the paper with final remarks.Before ending the current section, it is worth pointing out the main contributions of the paper as follows.(iii) An experiment of foveated rendering further demonstrates the novelty and effectiveness of recurrent neural networks for solving the problem of simulating the gaze behavior of human reading.

Problem Formulation
Experimental findings in eye movement and reading research suggest that eye movements in reading are both goal-directed and discrete [6].This means that the saccadic system selects visual targets on a nonrandom basis and that saccades are directed towards particular words rather than being sent a particular distance.Under this view, there are a number of candidate words during any fixation, with each having a certain probability of being selected as the target for the subsequent saccade.For our purposes we will assume that a probabilistic saccade model assigns a probability to fixation sequences resulting from saccade generation over the words in a text.Let us use the following simple representations of a text and fixation sequence.Let  denote a set of readers, and text  represent a sequence of word tokens (text) (w 1 , . . ., w n ).Let  denote a sequence of fixation token positions in , generated by a reader  ( ∈ ).The fixation token positions set () is the set of token positions that removes repetitive elements from , where (S | T, r) is a reader-specific distribution of eye movement patterns given a text .For example, the text "Kate quivered and went to the window" is represented by  = (Kate, quivered, and, went, to, the, window).A sequence of fixations in a reader's reading of the text which is Kate-quivered-and-wentto-the-window is represented by  = (1, 2, 3, 1, 2, 4, 6, 7); and the corresponding fixation token positions set is () = {1, 2, 3, 4, 6, 7}.
The next object of study is the prediction of the fixation point sequence  based on a specific reading event  involving the reader  reading the text .The training data consist of words and a Boolean value indicating whether they are fixation words, in the form of ((w 1 , boolFixation 1 ),. . ., (w n , boolFixation n )).
M is a recurrent neural networks model.Given some text as input, the purpose of this model is to generate a sequence of fixation points that approximate human reading behavior.We evaluate the performance of model  by comparing the predicted fixation sequence set   with the actual fixation sequence set   observed in a reading experiment involving  and .To do this, we train  on a novel set of texts  = {X 1 , . . ., X m } generated by a reader  ∈ .The goal is to infer

Convergence Analysis
For neural networks with  hidden nodes and  training data samples, if the ReLU activation function is used, a previous study in [31] has shown that, as long as  is sufficiently enough, the randomly initialized gradient descent algorithm converges to the global optimal solution.In this section, by following the gradient descent provably optimized method, the convergence performance of our RNN-based model is presented below.First, we consider a recurrent neural network of the following form: where  is the input vector, ℎ is a hidden layer,  is an inactive hidden layer,  is the matrix of  (t) to ℎ (t) ,  is the matrix of ℎ (t−1) to ℎ (t) , and  is the bias of the hidden layer.
Use the tanh activation function to activate all nodes of the layer  to the layer ℎ: Map the layer ℎ (t) to the layer  (t) : where  is the matrix of ℎ (t) to  (t) .Use  (t) to derive the predicted value ŷ and loss function of the model.
where  is an activation function,  is the transform matrix of  t−1 to  t ,  is the state matrix of  t , Z is a scaling factor, and  and  are weights, which is defined as There are 3 parameters that need to be optimized, namely, U, V, and W.Among them, the optimization process of the two parameters of W and U needs to trace the historical data, the parameter V is relatively simple and only needs the current data, and then we will solve the partial derivative of the parameter V first.
The solution of the partial and partial derivatives of W and U is relatively complicated because of the need to involve historical data.Let us assume that there are only three moments, and then the partial derivative of L to W at the third moment is (3)  ℎ (3)   ℎ (3)  ℎ (2)   ℎ (2)   +  (3)   (3)    (3)  ℎ (3)   ℎ (3)  ℎ (2)   ℎ (2)  ℎ (1)   ℎ (1)   (10) It can be observed that, at some point, the partial derivative of L to W needs to trace back all the information before this moment.We can write the general formula of L to the partial derivative of W at time t according to formula (10): To prove that the RNN-based model will converge, it is only necessary to prove that there is an upper limit on the number The proof is thus completed.Proof.At the k-th iteration, we have ‖ ŷ Next, we calculate the difference in prediction between two consecutive iterations.
As in the classical analysis of gradient descent, we also need to bound the quadratic term.
This indicates that there is an upper limit on the number of misclassifications, and the algorithm will converge when the final number of misclassifications reaches the upper limit.So if the data is linearly separable, then the model does converge.

Network Architecture
In this section, we describe a CNN-LSTM-CRF-based network architecture for reading eye movement prediction, consisting of a word embedding layer, a CNN layer, a bidirectional LSTM layer, and a CRF layer from bottom to top.

Word Embedding and CNN Layer.
It was shown in [28] that word embedding plays a crucial role in improving the performance of sequence labeling.We vectorize the text corpus by converting each text into a sequence of integers (each integer is an index of the mark in the dictionary), where the coefficients of each mark can be based on the number of words.
Previous studies, e.g., [32,33], showed that convolutional neural networks (CNNs) are effective methods for extracting word form information from characters in a word and encoding it into a neural representation.Our CNN is similar to that used in [33], except that we use not only word embedding as CNN input, but also the linguistic features of the word.

RNN Layer.
One of the key features of recurrent neural networks (RNNs) is that they can be used to connect past information to current tasks, such as inferring the meaning of a current word from previous words.However, a RNN cannot connect past information when the gap between the relevant information and the current position increases.
Long Short-Term Memory (LSTM) network is a special RNN that can learn long-term dependencies through specialized design.A LSTM is also a multilayered type of the neural network, in which the single neural network layer contains four interactive sublayers, as shown in Figure 2.
In many sequence labeling tasks, it is best to have access to past (left) and future (right) contexts, while LSTM's hidden state does not get information from the future.An excellent solution is the bidirectional LSTM (bi-LSTM) [34], whose validity has been proven in previous studies.Therefore, we can effectively use past features (through the forward state) and future features (through the backward state).We use The repeating unit in LSTM networks.
backpropagation through time (BPTT) [35] to train bi-LSTM networks, which help to process multiple sentences at the same time.

CRF Layer.
A conditional Random Fields (CRF) is a conditional probability distribution model whose nodes can be divided exactly into two disjoint sets X and Y, which are the observed and output variables, respectively.Under CRF, the inference problem for CRF is essentially the same as for a Markova Random Field (MRF) where an input random variable X is given, and the output Y is observed.The output variable Y represents a saccade target (fixation) label sequence, and the input variable X represents the word sequence that needs to be marked or labeled.The reading saccade target prediction problem to be solved is transformed into a sequence labeling problem in this paper.Under the condition that the random variable X is x, the conditional probability for the random variable Y to be valued as y is In this expression, Z (x) is a scaling factor; t k is a transfer feature function that depends on the current landing position and the previous position.s l is a state feature function that also depends on the current landing position.The value of the feature function t k and s l is 1 or 0, which means when it meets the feature condition, the value is 1; otherwise it is 0.  k and  l are weights, which are obtained by training the model.Parameter training is achieved based on a maximum likelihood criterion and maximum a posteriori criterion, whose goal is to maximize the probability of correctly marking the target sequence in the training set.
Whether the current word (x) is a fixation word (y i ) depends not only on the feature value of the current word (x), but also on whether the previous word is a fixation word (y i−1 ).This coincides with the characteristics of the linear chain CRF.

CNN-LSTM-CRF Architecture.
In the final stage, we built our neural network model by feeding the output vector of the bi-LSTM into the CRF layer.Figure 3 details our network architecture.

Model Training
In this section, we provide detailed information on training neural networks.We use the keras-contrib library [36] to implement a neural network that contains useful extensions to the official Keras package [37].Model training is run on the GeForce GTX 1070 graphical processing unit.It takes approximately 20 min to complete the model training using the setup discussed in this section.

Datasets.
There are several eye-tracking corpora in existence.Among these, the Dundee corpus [38] is notable.However, the Dundee corpus is not publicly available due to licensing restrictions.Our experiments used data from the Provo corpus [39], which is publicly accessible and may be downloaded from the Open Science Framework at https://osf.io/sjefs.The eye-tracking corpus has eye movement data from 84 native English-speaking participants, all of whom read the complete 55 texts for comprehension, including online news articles, popular science magazine articles, and public domain works of fiction.Eye movements were documented using a SR Research EyeLink 1000 Plus eyetracker with a spatial resolution of 0.01 ∘ and a sampling rate of 1000 Hz (see [39] for details).
For the experiments reported here, the corpus was randomly divided into three data sets in the following proportions: 60% texts for training, 20% texts for development and validation, and the last 20% texts for testing.

Features.
Evidence from the psychological literature indicates that the selection mechanism of fixation and saccade is determined by the combination of low-level visual cues (such as word length) and language cognitive factors of the text [7,40].The linguistic factors known to influence whether or not to skip words are word length, frequency, and predictability; that is, shorter words in the text are easier to skip than longer words.Predictability involves high-level cognitive factors that are difficult to quantify.Thus, we use parts of speech (POS) as a proxy to represent the highlevel cognitive factors.Words in the corpus are tagged for parts of speech using the Constituent Likelihood Automatic Word-Tagging System (CLAWS) [41].Using the CLAWS tags, words are divided into nine separate classes.The passages contain a total of 682 nouns, 502 verbs, 364 determiners, 287 prepositions, 227 adjectives, 196 conjunctions, 169 adverbs, 109 pronouns, and 153 other words and symbols.
Ultimately, the features used by the neural network consist of a tokenized word, word length (for low-level visual features), and parts of speech (for high-level cognitive features).

Training Procedure.
The model used in this paper was trained using general stochastic gradient descent (SGD), (1) for epoch in epochs: (2) for batch in batchs: (3) (1) bi-LSTM model forward pass: (4) forward pass for forward state LSTM (5) forward pass for backward state LSTM (6) (2) CRF layer forward and backward pass (7) (3) bi-LSTM model backward pass: (8) backward pass for forward state LSTM (9) backward pass for backward state LSTM (10) (4) update parameters For each training sample, the algorithm first initialized random weights and threshold parameters, then provided relevant input examples to the input layer neurons, and forwarded the signals layer by layer (input layer -> hidden layer -> output layer) until the output layer produced an output value.Then, the error of the output were calculated according to the output value, and then the error was reversely propagated to the neurons of the hidden layer, and finally the weight of the connection and the threshold of the neuron were adjusted according to the error calculated by the hidden layer neurons.The BP algorithm continually iteratively looped through the above steps until the conditions of stopping training were reached.
Since there were 1200 different words in the development set, and each sentence contained less than 60 words, we set the word embeddings dimension to 1200 and the word embeddings length to 60.In the one-dimensional convolution operation, it was equivalent to the feature extraction of n gram using neural network, so the optional parameters were 2, 3, and 4.
To find the optima values for hyperparameters, we used the following strategies: first, we determined the type of activation function, and then determined the type of cost function and the method of weight initialization.Secondly, according to the network topology, the number of neurons in each hidden layer in the neural network was determined.Then, for the remaining hyperparameters, a possible value was randomly given first.In the cost function, the existence of the regular term was not considered first, and the learning rate was adjusted to obtain a suitable threshold.Half of the threshold was taken as the initial value in the process of adjusting the learning rate.The size of the batch was determined through experiments.Then we used the determined learning rate to carefully adjust the learning rate and the validation data to select good regularization parameters.
After the regularization parameters were determined, we went back and reoptimize the learning rate.The number of epochs was determined by a whole observation through the above experiments.Among all the parameters, the type of neuron activation function should be selected first.The Sigmoid, Tanh, and ReLU functions are the most commonly used activation functions [42], but the Sigmoid and Tanh functions will encounter the problem of gradient disappearance, which is not conducive to the extension of the neural network to a deeper structure.The ReLU overcomes this problem because it solves the problem of gradient disappearance and therefore allows the neural network to extend to deeper layers.So we chose the ReLU as the activation function here.
Since our task was a classification task, the cost function that should be used in the experiment was the cross-entropy.For an input layer with n in neurons, the initialization weight is a Gaussian random distribution with a mean of 0 and a standard deviation of 1/√  .We determined the optimizer by random search.The super-parameters to be tuned included SGD [42], AdaDelta [43], Adam [44], and RMSProp [45].
The adjustment steps of the learning rate were as follows: first, we choose the estimation that the cost on the training data immediately began to decrease rather than oscillate or increase as the learning rate threshold, and it was not necessary to be too precise to determine the magnitude.If the cost began to decline in the first few rounds of training, we gradually increased the magnitude of the learning rate.If the cost function curve began to oscillate or increase, we tried to reduce the magnitude until the cost fell at the beginning of the round.Taking half of the threshold determined the learning rate.
Table 1 summarizes the relevant hyperparameters used in the experiments determined by the development set, the common strategies, and the random search method.The last column in the table is the parameter range of the random search.In order to avoid the overfitting problem, we used a simple dropout strategy [46] to drop 10% of the redundant nodes.

Experimental Verification
6.1.Evaluation Metrics and Baselines.We followed the evaluation metrics and baselines in NN09 [12] and HMKA12 [17].First, we quantified the number of words in the test set and used the fixation word rate for each subject as a baseline (see Table 2).Then, the accuracy of the fixation word prediction was used as an evaluation metrics, and the fixation/skip distribution of each word in the verification set was predicted.The accuracy of each predicted distribution was calculated from the distribution in the fixation data.Finally, the accuracy of each word in the validation set was averaged.

Result Analysis.
Based on the analysis in Section 5.2, we set up features to predict the fixation points.The examined features can be classified into two types: low-level visual features and high-level cognitive features.In the experiments, we explored the contribution of low-level visual and highlevel cognitive features individually and in combinations to prediction accuracy.
Based on the experimental settings discussed in Section 5, we run at least 100 times with different random initialization of parameters.All experiments were trained using Stochastic Gradient Descent (SGD).We reported the mean accuracy and the standard deviation (Δ std ) and thus more fully characterize the distribution of accuracies in the predictions.
The experimental results in Table 3 show that the word length feature (for low-level visual features, denoted by "WL") has an accuracy of 76.53%, while the part of speech feature (for high-level cognitive features, denoted by "POS") provides less accuracy.
The achieved test performances can be plotted in a violin plot as shown in Figure 4.The violin plot is similar to a boxplot; however, it estimates from the samples the probability density function and depicts it along the Y-axis.If a violin plot is wide at a certain location, then achieving this test performance is especially likely.Besides the probability density it also shows the median as well as the quartiles.In Figure 4 we can observe that the WL feature usually results in a higher performance than the POS feature for the fixation prediction task.Hence, we can conclude that the low-level visual feature is a better option for this task.
We also considered combinations of the two feature types.From the Figure 5, we can see that adding other features to the WL feature barely contributes to an improvement in accuracy.Additionally, the influence of the POS feature on improving the accuracy is not obvious.The prediction accuracy obtained by the POS feature is similar to the baseline accuracy.These observations seem to imply that high-level cognitive features do not capture very much extra information when using the features separately.This would suggest that combined features work well only in conjunction with low-level visual features.
In summary, we can draw the conclusion that the lowlevel visual cues have a key impact on the selection of saccade targets compared with the high-level cognitive factors.
The authors in [12,17] used the Dundee corpus to train and test their models.Owing to licensing restrictions, our experiments are based on the data from the Provo corpus.Because of the different experimental settings, we cannot simply compare our experimental results with those from [12,17].However, considering that we were able to obtain similar accuracy as those using NN09 [12] or HMKA12 [17], and, moreover, we used far fewer features and required much less preprocessing than did NN09 or HMKA12 (see Table 4), these results indicate that the proposed RNN-based model performs well in simulating reading eye movements.

Application Example.
Interactive graphics environments require high refresh rates, high resolutions, and low latencies, each of which adds computational burden on the hardware.To address the problems, the state-of-the-art technology that integrates with eye tracking is known as foveated rendering [47].Foveated rendering can make the fixation point that the user is focusing on clearer and replaces the adjacent area with a blurred image, which is in line with the mode of human vision.In this way, the machine does not have to render the entire picture in detail, which can greatly reduce the computational burden on the GPU.However, accurate fixation tracking systems are still expensive and can only be accessed by a limited number of researchers or companies [48].Another way to compute the fixation point is to use an eye movement model that simulates the gaze behavior of human.while requiring fewer features.We believe that there are two reasons for this.On the one hand, it uses a convolution operation and objectively increases the number of training samples.On the other hand, patterns of fixations and saccades are driven in part by low-level visual cues and high-level linguistic and cognitive processing of the text.CRF can account for the transfer features and state features of label sequences, in line with how humans handle low-level visual features.A RNN is a time-recursive neural network that can process and foresee events following particularly large intervals and delays in a time series, in correspondence with human handling of high-level visual features.Placing the RNN before the CRF is equal to utilizing the language relationships extracted by the RNN to train the CRF.The proposed model takes advantage of the context of the text sequence and the label sequence, which is more in line with the reality of the reading process.

Conclusions
In this paper, a new type of eye movement model was developed and evaluated in terms of its ability to simulate eye movements of human reading.Theoretical analysis demonstrated that the RNN-based model always converges to a global solution.In comparison with conventional eye movement models, the new approach was shown to achieve similar accuracy in predicting a user's fixation points during reading.
In addition, the proposed model has less reliance on data features and requires less preprocessing than current machine learning models, which makes the proposed model attractive in a range of human-computer application areas.The verification results further demonstrate the novelty and efficacy of RNNs for simulating eye movements of human reading.

Figure 1 :
Figure 1: Eye track of an adult reader.Black dots indicate the position of the gaze point, the number next to each dot indicates the duration of each gaze point (in ms), and the arrow indicates the direction of the saccade.
(i) This paper proposes and tests a new type of eye movement model based on recurrent neural networks, which is quite different from previous research on eye movement model.(ii) The convergence of the RNN-based model for predicting eye movement of human reading is proved in this paper.

Figure 3 :
Figure3: The CNN-LSTM-CRF architecture for predicting fixation in reading.The first layer is the embedding layer, which vectorizes the text corpus by converting each text into a sequence of integers.The second layer is the CNN layer, which extracts word form information from characters in a word and encoding it into a neural representation.The third layer is the bidirectional LSTM neural network, which can effectively use past (left) and future (right) contexts.The fourth layer is the CRF layer, which is used for sentence-level sequence labeling.

Algorithm 1 :
The model training procedure.forward propagation, and backpropagation (BP) algorithms.The training procedure is shown in Algorithm 1.
of incorrect training examples on the training data set.The following lemmas are presented before the proof.