Hidden Semi-Markov Models to Segment Reading Phases from Eye Movements

Our objective is to analyze scanpaths acquired through participants achieving a reading task aiming at answering a binary question: Is the text related or not to some given target topic? We propose a data-driven method based on hidden semi-Markov chains to segment scanpaths into phases deduced from the model states, which are shown to represent different cognitive strategies: normal reading, fast reading, information search, and slow confirmation. These phases were confirmed using different external covariates, among which semantic information extracted from texts. Analyses highlighted some strong preference of specific participants for specific strategies and more globally, large individual variability in eye-movement characteristics, as accounted for by random effects. As a perspective, the possibility of improving reading models by accounting for possible heterogeneity sources during reading is discussed.

The considered parametric distributions for the quantitative variables were uniform, Poisson, binomial and negative binomial with shift parameters in the four cases. For example if D is a random variable distributed as the usual binomial distribution B(n, p) with probability p, the shifted binomial B(s, n, p) is the distribution of D+s where s is some (deterministic, unknown) additional shift parameter.
Since states are unknown (even after estimating λ), modellers cannot choose their values or interpretations. In practice one crucial tool to interpret parameters is matrix B. State k is mostly defined by the composition of state k in the different possible Read mode values, i.e., P(X t =Bwd-| S t =k), ..., P(X t =Fwd+ | S t =k). For example, if state k has a large probability P(X t =Fwd+ | S t =k) compared to P(X t =Fwd+ | S t =j) for the other states j, state k can be interpreted as a fast reading state, although even if that state P(X t =Bwd-| S t =k) > 0 so that regressions could occur occasionally with, say, a lower probability than if the state was j. The other main tool to interpret states is state restoration, which aims at finding a sequence of states values S 1 ,...,S T for a given scanpath summarized by the observed Read mode sequence of X 1 ,...,X T , such that S 1 ,...,S T is the most probable state sequence given X 1 ,...,X T and λ. This so-called restored sequence, computed by the Viterbi algorithm (see Yu, 2010), actually provides the segmentation in homogeneous zones with respect to the distribution of X t , such as the segmentations depicted in Figures S11 and S12 within this document. As a complement to interpret states, we can use the distributions of other covariates Y t that have not been used to estimate λ (e.g., saccade amplitude, fixation duration and reading speed).
The number K of hidden states is unknown and has to be determined. Here we used BIC (Boucheron & Gassiat, 2007) to select K. BIC is defined by minus the loglikelihood value (after maximisation with respect to parameters) plus a positive term, referred to as penalty, that depends on the number of model parameter and the logarithm of the sample size. Since the maximum loglikelihood necessarily growths as K increases, the penalty is introduced to avoid overfitting. The penalty ensures that if the sample size is large enough and that one of the compared models is the right one, BIC will find it with a probability that tends to one (this principle extends to generalised linear mixed models hereafter). In practice, we set a maximal value K max for K and for every value between K=1 and K= K max , we estimated a model with K states and computed its BIC. We then kept the value of K with lowest BIC value, indicating an optimal trade-off between data fit and model complexity. Models were discarded if they yielded mean segment lengths shorter than 4 fixations or longer than 25 fixations, since these were either specific of one single subject or could not be interpreted as reading strategies. In some cases, we identified that states actually were a fine-scale decomposition of some macroscopic state defined as a pattern involving short cycles between the fine-scale states. For example, a fast alternation between states 0 and 1, which always occurred together was observed. In this case, these cycles were merged into macroscopic states referred to as "phases" for the sake of interpretability. This was only observed for states 0 and 1 which was merged to define phase 1. For the other states, the state and the associated phase were identical.
To highlight subject variability in scanpaths, correspondence analysis (CA) (Greenacre, 1984) and a chi-squared independence test were performed on the contingency table defined by the number of fixations in each phase for each subject, merging fixations from all scanpaths.
The effect of text type was assessed by tests in regression models. Three families of models were considered: linear mixed models (LMMs), binomial generalized linear mixed models (BGLMMs) and multinomial generalized linear mixed models (MGLMMs) with Gaussian random individual effects. LMMs were used to assess effects of covariates on quantitative variables (reading speed and fixation durations computed for each scanpath) using the lmer package of the R software (Venables & Ripley, 2002). Significance of fixed effects within a given model was determined by ANOVAs. BGLMMs were used to assess effects of covariates on binary variables (occurrence of phase transitions or not) using the glmer package in R. Model selection regarding fixed effects was achieved by computing BIC on the whole collection of models built from all subsets of covariates and their interactions. In practice, for each possible subset of covariates and interactions, we estimated that model and scored it with BIC. We kept the set of covariates and interaction minimising BIC, meaning that the effects of covariates and interactions absent from that model can be ignored, from a statistical point of view. The BIC for mixed models requires a specific definition to account for random effects and their variance possibly being equal to zero (at the boundary of the set of parameters). We used the adaptation proposed in (Delattre et al., 2014).
Significance of individual effects was assessed by comparing BIC values of the models with the best set of covariates, considering in turn variants with and without individual random effects. This was complemented by the use of confidence intervals on the standard deviation of random effects, using profile likelihood as described in (Bates et al., 2014). MGLMMs were used to assess the effect on nominal categorical variables (Read mode and Phase) using Bayesian estimation with the MCMCGlmm package in R (Hadfield, 2010), since no referenced package could estimate MGLMMs by maximum likelihood while providing confidence intervals and BIC values. Significance of individual effects was assessed using credibility intervals at level 0.995 on variance parameters. In the case of MGLMMs, we used DIC (Spiegelhalter et al., 2002) instead of BIC to assess significance of fixed effects. DIC is an information criterion such that penalises data fit by model complexity. BIC has a closed-form penalty, while the penalty in DIC is implicit. The model minimizing the considered criterion (either BIC or DIC, depending on models) was selected.
Consistently with the workflow describe above, the effect of text type (categorical variables HR, MR and UR) on the number of fixations per scanpath, fixation duration, saccade amplitude, reading speed, and Read mode frequencies were assessed using MGLMMs and the MCMCGlmm package in R. The effect on quantitative variables was assessed using LMMs. In both cases, random subject effects were included and their significance was tested. Effects of categorical predictors were tested using analyses of variance (ANOVAs). Normality of residuals in LMMs was assessed using Shapiro-Wilk normality tests complemented with histograms of empirical residuals.
"Trigger words" were detected using a FastText representation of words (Joulin et al., 2017). This consisted in embedding words into Euclidean spaces, allowing for computing semantic proximities between words using Euclidean metrics (here, the cosine distance). Only two "trigger words" per text were defined. Trigger words were the two closest words to target topic in HR / HR+ texts. In HR+ texts by definition, at least one word had cosine similarity 1. It was required in HR+ texts that the second closest word had minimal cosine similarity 0.3, otherwise only one "trigger word" was defined. It was required in HR texts that both closest words had minimal cosine similarity 0.3. Indeed, HR texts could have a very progressive semantic progression towards target topic, without clear trigger word. A threshold of 0.3 allowed to exclude these situations: HR scanpaths where all fixed words had cosine similarity less than 0.3 were ignored. In UR texts, trigger words corresponded to the two furthest words to target topic. Finally in MR texts, the two trigger words corresponded to the closest word and to the furthest word to target topic (no required bounds on cosine similarity).
Since HSMC states are random and hidden, the times of transitions are uncertain. Thus, instead of considering transition or not at trigger words, the effect of distance of transitions to trigger words was measured in number of fixations, focusing on trigger words with lowest distance to transitions. Its effect of transition probabilities was assessed using regression models. Firstly, frequencies for the distances associated to each incoming phase (among every possible distance for that phase) were modelled with linear mixed regressions, using distance, text type and phase as predictors, with subjects as random effects. Secondly, the binary random variable corresponding to occurrence or not of a transition at each possible distance of a fixation to trigger word was modelled with generalized linear mixed regressions. Binomial distributions were considered, using the canonical link function and the same three predictors as above. In both approaches, models with interactions of order 2 and 3 between predictors were estimated, in addition to models without interaction. Models were compared using BIC. The model with minimal BIC (referred to as M1) was then used to assess the effect of random subject effects, by comparing BIC with that of a model without random effects. M1 was also compared with the model obtained by removing distance as a predictor (referred to as M0). The justifications for using both approaches (linear models on frequencies or GLMMs on binary variables) were twofold: firstly, GLMMs easily suffer from lack of convergence for high-order interactions and thus some of these models cannot be compared. Secondly, the linear assumptions on frequencies seemed reasonable given the shape of the cloud of points (See Figure 6).
Since texts were rather short and total numbers of fixations were rather low (see Subsections "Materials" and "Summary statistics on observed data"), the effect of small distances increasing transition probabilities could be credited to distances being necessarily small, even if transitions were drawn at random and independently from the positions of trigger words. To assess this possible bias, randomized procedures were developed. In their spirits, these consisted in reassigning random positions to trigger words, though in practice refined procedures were developed to preserve some phase structure in scanpaths. The first one consisted in sampling transition locations (uniformly without replacement) and permuting the order of phases, thus constraining the number of transitions to remain the same within each scanpath. The second one consisted in sampling the number of transitions with replacement within their empirical distribution and drawing phase values uniformly, with the constraint that successive phases must be different. In both cases, the whole data set was resampled 1 000 times. Each time M1 and M0 were estimated again, as well as their difference in BIC. The percentage of differences obtained by resampling was compared to the true difference, thus assessing the significance of the distance effect.

Detailed presentation of regression models
We present here the regression models selected at each step of the statistical analysis (meaning, after selection by some information criterion).

(Absence of ) effect of text type on Read mode
Model: ,, where X t,s,p represents Read mode at position p within scanpath s with text type t. Table S1. Posterior distribution of GLMM parameters x  : posterior mean, left limit (l-95% CI) and right limit (r-95% CI) of a 95% credible interval, one minus level required for the credible interval to contain 0 (pMCMC)

Read mode
Posterior mean l-95% CI u-95% CI pMCMC    Table S4. HSMC sojourn duration and emission probabilities per state. B(i, n, p) is the shifted Binomial distribution with shift parameter i, number of trials n and probability p. NB(i, q, p) is the shifted Negative Binomial distribution with shift parameter i, shape parameter n and probability p. ∞ indicates an absorbing state. Emission probabilities are the probabilities of each possible ReadMode value in each state.

Supplementary figures
Boxplots indicate the three quartiles of distributions and 1.5 interval quartile ranges of the lower and upper quartiles.  Figure S1. Scanpath of some text read by subject 8. Fixations are numbered by order. Translation: "As if echoing the self-confidence regained since the end of the war, the Foreign Affairs committee of Senate set up a date to begin auditions and to proceed to nomination, which had been blocked for ten months".        Figure S11. Sample distribution of Read mode for each text type. Read mode 0 corresponds to Bwd-, 1 to Bwd, 2 to Rfx, 3 to Fwd and 4 to Fwd+.
13 Figure S12. Estimated HMSC parameters. Lines 1 and 2: estimated emission and sojourn state distributions for each state. States are in columns. Line 1: emissions distributions are defined as P(Xt=x | St=k) for the states k and Read mode values x. The latter are coded as follows: Bwd-/ 0 / Wt < -1; Bwd / 1 / Wt = -1; Rfx / 2 / Wt = 0; Fwd / 3 / Wt = 1; Fwd+ / 4 / Wt > 1. State 4 is absorbing and has not estimated sojourn time distribution. Red vertical bars correspond to the number of observations (Line 1: Read mode, Line 2: sojourn durations) for the different states, which are restored by the Maximum A Posteriori principle. Green curves correspond to the distributions estimated by the EM algorithm, with their areas under curve rescaled by the number of observations. Line 3: estimated transition diagram between states. Vertices correspond to states and, arcs to transition probabilities above 0.01. The initial state distribution is represented by pink arrows pointing to possible initial states but issued from no other state.
15 Figure S13. Scanpath of some UR text with phase restoration. Target topic is "Contemporary art". Phase 3 (fast reading) is in red. Translation: "Judicial investigation for accidental injury due to negligence was opened after a six-year-old boy fell from a train joining Paris from Toulouse and lacking of secure locking mechanism". The words framed in black ("accidental" and "negligence") are the farthermost to target topic. Figure S14. Scanpath of some MR text with phase restoration. Target topic is "Conflict in Irak". Phase 3 (fast reading) is in red and Phase 4 (slow confirmation) in purple. Translation: "As if echoing the self-confidence regained since the end of the war, the Foreign Affairs committee of Senate set up a date to begin auditions and to proceed to nomination, which had been blocked for ten months".