Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus

The surprisal of a word on a probabilistic grammar constitutes a promising complexity metric for human sentence comprehension difficulty. Using two different grammar types, surprisal is shown to have an effect on fixation durations and regression probabilities in a sample of German readers’ eye movements, the Potsdam Sentence Corpus. A linear mixed-effects model was used to quantify the effect of surprisal while taking into account unigram frequency and bigram frequency (transitional probability), word length, and empirically-derived word predictability; the socalled “early” and “late” measures of processing difficulty both showed an effect of surprisal. Surprisal is also shown to have a small but statistically non-significant effect on empirically-derived predictability itself. This work thus demonstrates the importance of including parsing costs as a predictor of comprehension difficulty in models of reading, and suggests that a simple identification of syntactic parsing costs with early measures and late measures with durations of post-syntactic events may be difficult to uphold.

Reading a sentence involves a succession of fixations and saccades, with information uptake occuring mainly during fixations.The duration of a fixation at a word is known to be affected by a range of word-level factors such as token frequency and empirical predictability as measured in a Cloze task with human subjects (Taylor, 1953;Ehrlich & Rayner, 1981;Kliegl, Grabner, Rolfs, & Engbert, 2004).
When words appear in sentences -as opposed to in isolation -their occurrence is evidently affected by syntactic, semantic and other factors.Research within psycholinguistics over the past half-century has exposed the role of some of these sentence-level factors in accounting for eye movements.Clifton et al. (2007) provides a review of this work, and calls for the development of explicit theories that combine wordlevel and sentence-level factors.Of course, such combined models would be unnecessary if it turned out that sentence-level factors actually have very little effect on eye movements.These sorts of factors do not figure in current models of eye-movement control such as E-Z Reader (Pollatsek, Reichle, & Rayner, 2006) and SWIFT (Engbert, Nuthmann, Richter, & Kliegl, 2005), whose difficulty predictions derive primarily from sta-tistical properties of individual words and their immediate neighbors.
In this paper, we cast doubt on this simpler view by exhibiting a quantitative model that takes into account both word and sentence-level factors in explaining eye fixation durations and regression probabilities.We show that the surprise value of a word, on a grammar-based parsing model, is an important predictor of processing difficulty independent of factors such as word length, frequency, and empirical predictability.This result harmonizes with the rise of probabilistic theories in psycholinguistics defined over grammatical representations such as constituents and dependency relations (Jurafsky, 1996;Crocker & Brants, 2000;Keller, 2003).In addition to demonstrating the effect of surprisal on eye-movement measures, we also show that surprisal has a small but statistically nonsignificant effect on empirical predictability.
The paper is organized into three sections.The first section explains the concept of surprisal, summarizing the Hale (2001) formulation.The second section marshals several predictors -surprisal, word length, unigram frequency, bigram frequency (transitional probability in the sense of McDonald & Shillcock, 2003) and empirical predictability values -in a quantitative model of fixation durations and regression probabilities.We fit this model to the measurements recorded in the Potsdam Sentence Corpus (Kliegl, Nuthmann, & Engbert, 2006), making it possible to determine which predictors account for readers' fixation durations and regressive eye movements.The last section discusses implications of this fitted model for various linking hypotheses between eye movement measures and parsing theories.This final section also discusses the implications of the results for E-Z Reader (Pollatsek et al., 2006) and SWIFT (Engbert et al., 2005).

Surprisal
Surprisal is a human sentence processing complexity metric; it offers a theoretical reason why a particular word should be easier or more difficult to comprehend at a given point in a sentence.Although various complexity metrics have been proposed over the years (Miller & Chomsky, 1963;Kaplan, 1972;Gibson, 1991;Stabler, 1994;Morrill, 2000;Rohde, 2002;Hale, 2006), surprisal has lately come to prominence within the field of human sentence processing (Park & Brew, 2006;Levy, in press;Demberg & Keller, 2008).This renewal of interest coincides with a growing consensus in that field that both absolute as well as graded grammatical factors should figure in an adequate theory.Surprisal combines both sorts of considerations.
This combination is made possible by the assumption of a probabilistic grammar.Surprisal presupposes that sentence-comprehenders know a grammar describing the structure of the word-sequences they hear.This grammar not only says which words can combine with which other words but also assigns a probability to all well-formed combinations.Such a probabilistic grammar assigns exactly one structure to unambiguous sentences.But even before the final word, one can use the grammar to answer the question: what structures are compatible with the words that have been heard so far?This set of structures may contract more or less radically as a comprehender makes their way through a sentence.
The idea of surprisal is to model processing difficulty as a logarithmic function of the probability mass eliminated by the most recently added word.This number is a measure of the information value of the word just seen as rated by the grammar's probability model; it is nonnegative and unbounded.More formally, define the prefix probability of an initial substring to be the total probability of all grammatical 1 analyses that derive w = w 1 • • • w n as a left-prefix (definition 1).Where the grammar G and prefix string w (but not w's length, n) are understood, this quantity is abbreviated 2 by the forward probability symbol, Then the surprisal of the n th word is the log-ratio of the prefix probability before seeing the word, compared to the prefix probability after seeing it (definition 2).
As the logarithm of a probability, this quantity is measured in bits.
Consider some consequences of this definition.Using a law of logarithms, one could rewrite definition 2 as But on a well-defined probabilistic grammar, the prefix probabilities α are always less than one and strictly nonincreasing from left to right.This implies that the two logarithms are to be subtracted in the opposite order.For instance, if a given word brings the prefix probability down from 0.6 to 0.01, the surprise value is 4.09 bits.
Intuitively, surprisal increases when a parser is required to build some low-probability structure.The key insight is that the relevant structure's size need not be fixed in advance as with Markov models.Rather, appropriate probabilistic grammars can provide a larger domain of locality.This paper considers two probabilistic grammars, one based on hierarchical phrasestructure 3 and another based on word-to-word dependencies.These two grammar-types were chosen to il-1 In this definition, G is a probabilistic grammar; the only restriction on G is that it provide a set of derivations, D that assign a probability to particular strings.When D(G,u) = / 0 we say that G does not derive the string u.The expression D(G,wv) denotes the set of derivations on G that derive w as the initial part of larger string, the rest of which is v. See Jurafsky and Martin (2000), Manning and Sch ütze (2000) or Charniak (1993) for more details on probabilistic grammars.
2 Computational linguists typically define a statedependent forward probability α n (q) that depends on the particular destination state q at position n.These values are indicated in red inside the circles in figure 3(a).It is natural to extend this definition to state sets by summing the statedependent α values for all members.To define the surprisal of a left-contextualized word on a grammar the summation ranges over all grammatically-licensed parser states at that word's position.The notation α n (without any parenthesized q argument) denotes this aggregate quantity. 3The probabilistic context-free phrase-structure grammars were unlexicalized.
See Stolcke (1995) for more information in the methods used in this work.
For this purpose, we adapted Levy's implementation of the lustrate surprisal's compatibility with different grammar formalisms.Since the phrase-structure approach has already been presented in Hale (2001), the next two sub-sections elaborate the dependency grammar approach.
Estimating the parser's probability model A probabilistic dependency parser can proceed through this sentence from left to right, connecting words that stand in probable head-dependent relationships (Nivre, 2006).In this paper, parser-action probabilities are estimated from the union of two German newspaper corpora, NEGRA (Skut, Krenn, Brants, & Uszkoreit, 1997) and TIGER (K önig & Lezius, 2003), as in Figure 1. Figure 1 defines the method of estimating the parser probabilities from the corpus data.A simulation of the parser is run on the training data, yielding a series of parser states and transitions for all sentences in the corpora.This information informs several features (Hall, 2007), which are then used to condition the probabilities of each transition.A Maximum Entropy training model (Charniak & Johnson, 2005) was used to weight each feature instance for better accuracy.

Estimating surprisal
The prefix probability (definition 1) may be approximated to any degree of accuracy k by summing up the total probability of the top k most probable analyses defined by the dependency parser.Then surprisals can be computed by applying definition 2 following Boston and Hale (2007).Figure 2 shows the surprisals associated with just two of the words in Example 3.
Figure 2 also depicts the dependency relations for this sentence, as annotated in the Potsdam Sentence corpus. 4Following Tesnière (1959)  (1964), the word at the arrow head is identified as the 'dependent', the other is the 'head' or 'governor'.The associated part-of-speech tag is written below each actual word; this figures into the surprisal calculation via the parser's probability model.The thermometers indicate surprisal magnitudes; at alte, 0.74 bits amounts to very little surprise.In TIGER and NEGRA newspaper text, it is quite typical to see an adjective (ADJA) following an article (ART) unconnected by any dependency relation.By contrast, the preposition in is most unexpected.Its surprisal value is 23.83 bits.

and Hayes
The surprisal values are the result of a calculation that makes crucial reference to instantaneous descriptions of the incremental parser.Figure 3(a) schematically depicts this calculation.At the beginning of Example 3, the parser has seen der but the prefix probability is still 1.0 reflecting the overwhelming likelihood that a sentence begins with an article.Hearing the second word alte, the top k = 3 destination states are, for example, q 8 , q 17 and q 26 (the state labels are arbitrary).Figure 3(b) reads off the grammatical significance of these alternative destinations: either alte becomes a dependent of der, or der becomes a dependent of alte or no dependency predicated.Each transition from state q 1 to states q 8 , q 17 and q 26 has a corpus-estimated probability denoted by the values above the arc (e.g., the transition probability to q 8 = 0.3).Approximating definition 1, we find that the total probability of all state trajectories5 arriving in one of those top 3 is 0.6, and thus the surprisal at alte is 0.740 bits.
When the parser arrives at in, the prefix probability for the word has made its way down to 6.9 × 10 −63 .Such miniscule probabilities are not uncommon in broad-coverage modeling.What matters for the surprisal calculation is not the absolute value of the prefix probability, but rather the ratio between the old prefix-probability and the new prefix-probability.A high α n−1 /α n ratio means that structural alternatives have been reduced in probability or even completely ruled out since the last word.
For instance, the action that attaches the preposi- tion in to its governing verb goss is assigned a probability of just over one-third.That action in this left-context leads to the successor state q 88 with the highest forward probability (indicated inside the circles in red).Metaphorically, the preposition tempers the parser's belief that goss has only a single dependent.Of course, k-best parsing considers other alternatives, such as state q 96 in which no attachment is made, in anticipation that some future word will attach in as a left-dependent.However these alternative actions are all dominated by the one that sets up the correct goss in dependency.This relationship would be ignored in a 3-gram model because it spans four words.By contrast, this attachment is available to the Nivre (2006) transition system because of its stackstructured memory.In fact, attachments to stets, 'always', ein, 'a', and wenig, 'little', are all excluded from consideration because the parser is projective, i.e., does not have crossing dependencies (Kahane, Nasr, & Rambow, 1998;Buch-Kromann, 2007).The essence of the explanation is that difficult words force transitions through state-sets whose forward probability is much smaller than at the last word.This explanation is interpretable in light of the linguistic claims made by the parser.However, the explanation is also a numerical one that can be viewed as just another kind of predictor.The next section applies this perspective to modeling observed fixation durations and regression frequencies.
Predicting eye movements: The role of surprisal Having sketched a particular formalization of sentence-level syntactic factors in the previous section, this section takes up several other factors (table 1) that figure in models of eye-movement control.Two subsections report answers to two distinct but related questions.The first question is, can surprisal stand in, perhaps only partly, for empirical predictability?If empirical predictability could be approximated by surprisal, this would save eye-movement researchers a great deal of effort; there would no longer be a need to engage in the time-consuming process of gathering predictability scores.Unfortunately, the answer to this first question is negative -including surprisal in a model that already contains word-level factors such as length and bigram frequency does not allow it to do significantly better at predicting empirical predictability scores in the Cloze-type data we considered.
The second question pertains to eye-movement data.The second subsection proceeds by defining a variety of dependent measures commonly used in eye movement research.Then it takes up the question, does adding surprisal as an explanatory factor result in a better statistical model of eye-movement data?The answer here is affirmative for a variety of fixation duration measures as well as regression likelihoods.

Does surprisal approximate empirical predictability?
The Potsdam Sentence Corpus (PSC) consists of 144 German sentences overlayed with a variety of related information (Kliegl, Nuthmann, & Engbert, 2006).One kind of information comes from a predictability study in which native speakers were asked to guess a word given its left-context in the PSC (Kliegl et al., 2004).The probability of correctly guessing the word was estimated from the responses of 272 participants.This diverse pool included high school students, university students, and adults as old as 80 years.As a result of this study, every PSC word -except the first word of each sentence, which has no left contexthas associated with it an empirical word-predictability value that ranges from 0 to 1 with a mean (standard deviation) of 0.20 (0.28).These predictability values were submitted to a logit transformation in order to correct for the dependency between mean probabilities and the associated standard deviations; see (Kliegl et al., 2004) for details.
Table 1 enumerates a set of candidate factors hypothesized to influence logit predictability as sampled in the Kliegl et al. (2004) study.The candidate factors were taken into account simultaneously in a linear mixed-effects model (Pinheiro & Bates, 2000;Bates & Sarkar, 2007;Gelman & Hill, 2007) with sentences as random factors.
The Deviance Information Criterion or DIC (Spiegelhalter, Best, Carlin, & Linde, 2002;Spiegelhalter, 2006), (Gelman & Hill, 2007, 524-527) was used to compare the relative quality of fit between models.The DIC depends on the summary measure of fit deviance d = −2 × log-likelihood.Adding a new predictor that represents noise is expected to reduce deviance by 1; more generally, adding k noise predictors will reduce deviance by an amount corresponding to the χ 2 distribution with k degrees of freedom.DIC is the sum of mean deviance and 2×the effective number of parameters; mean deviance is the average of the deviance over all simulated parameter vectors, and the effective number of parameters depends on the amount of pooling in the mixed-effects model.Thus, in mixed-effects models DIC plays the role of the Akaike Information Criterion (Akaike, 1973;Wagenmakers & Farrell, 2004), in which the number of estimated parameters can be determined exactly.
In the linear mixed-effects models, neither version of surprisal showed a statistically significant effect. 6owever, the sign of the coefficient was negative for both variants of surprisal and DIC values were lower when surprisal was added as a predictor.This is as expected: more surprising words are harder to predict.The DIC was 2229 for the simpler model, versus 2220 for each of the two more complex models.Table 2 summarizes the models including surprisal as a predictor.
In sum, the analyses show that surprisal scores exhibit rather weak relations with empirical predictability scores; indeed, they are much weaker than unigram frequency and word length as well as corpus-based bigram frequency.Given the reduction in DIC values, however, including surprisal as part of an explanation for empirical word predictability appears to be motivated.This finding is consistent with the intuition that predictability subsumes syntactic parsing cost, among other factors, although clearly surprisal is not the dominant predictor.
The relation between surprisal and empirical word predictability, though weak, nevertheless raises the possibility that surprisal scores may account for variance in fixation durations independent of the variance accounted for by empirical predictability.We investigate this question next using eye movement data from the Potsdam Sentence Corpus.

Does surprisal predict eye movements?
Surprisal formalizes a notion of parsing cost that appears to be distinct from any similar cost that may be subsumed in empirical predictability protocols.It may thus provide a way to account for eye movement data by bringing in a delimited class of linguistic factors that are not captured by conscious reflection about upcoming words.
To investigate this question empirically, we chose several of the dependent eye movement measures in common use (tables 3 and 4).A distinct class of "first pass" measures reflects the first left-to-right sweep of the eye over the sentence.A second distinction relates to "early" and "late" measures.A widely accepted belief is that the former but not the latter reflect processes that begin when a word is accessed from memory (Clifton et al., 2007, 349).Although these definitions are fairly standard in the literature, controversy remains about the precise cognitive process responsible for a particular dependent measure.
In general, human comprehenders tend to read more slowly under conditions of cognitive duress.For instance, readers make regressive eye movements more often and go more slowly during the disambiguating region of syntactically-ambiguous sentences (Frazier & Rayner, 1982).They also slow down when a phrase must be 'integrated' as the argument of a verb that does not ordinarily take that kind of complement, e.g."eat justice" provokes a slowdown compared to "eat pizza." The surprisal complexity metric, if successful in accounting for eye movement data, would fit into the gap between these sorts of heuristic claims and measurable empirical data, alongside computational accounts such as Green and Mitchell (2006), Vasishth and Lewis (2006), Lewis et al. (2006) and Vasishth et al. (in press).
We used the dependent measures in tables 3 and 4 to fit separate linear mixed-effects models that take into account the candidate predictors introduced in the last section: the n-gram factors, word length, empirical predictability.For the analysis of regression probabilities (coded as a binary response for each word: 1 signified that a regression occurred at a word, and 0 that it did not occur), we used a generalized linear mixedeffects model with a binomial link function (Bates & Sarkar, 2007), (Gelman & Hill, 2007).Sentences and participants were treated as partially crossed random factors; that is, we estimated the variances associated with differences between participants and differences between sentences, in addition to residual variance of the dependent measures.Then we compared the Deviance Information Criterion value of these simpler models with those of more complex models that had an additional predictor: either surprisal based on the Note.An absolute t-value of 2 or greater indicates statistical significance at α = 0.05.(Inhoff, 1984) but cf.(Rayner & Pollatsek, 1987) (none) regression probability likelihood of jumping back to a previous word during the first pass resolution of temporary ambiguity (Frazier & Rayner, 1982;Clifton et al., 2003) DOI 10.16910/jemr.The calculation of the dependent measures was carried out using the em package developed by Logačev and Vasishth (2006).Regarding first-fixation durations, only those values were analyzed that were nonidentical to single-fixation durations.In each readingtime analysis reported below, reading times below 50 ms were removed and the dependent measures were log transformed.All predictors were centered in order to render the intercept of the statistical models easier to interpret.

Results
The main results of this paper are summarized in tables 5, 6, 7, and 8.In the multiple regression tables 6-8, a predictor is statistically significant if the absolute tvalue is greater than two (p-values are not shown for the reading time dependent measures because in linear mixed-effects models the degrees of freedom are difficult to estimate, Gelman & Hill, 2007).
In order to facilitate comprehension, the multiple regression tables 6-8 are summarized in a more compact form in figures 4 and 5.The graphical summary has the advantage that it is possible, at a glance, to see the consistency in the signs of the coefficients across different measures; the tables will not yield this information without a struggle.The figures are interpreted as follows.The error bars signify 95% confidence intervals for the coefficient estimates; consequently, if an error bar does not cross the zero line, it is statistically significant.This visual test is identical to computing a t-value.
In general, both early and late fixation-durationbased dependent measures exhibited clear effects of unigram frequency, bigram frequency, and logit predictability after statistically controlling for the co-stock of predictors (figures 4, 5).One exception was firstfixation duration (which excludes durations that were also single-fixation durations); here, the effect of pre- These simpler models were augmented with one of two surprisal factors, one based on dependency grammar, the other based on phrase-structure grammar.As summarized in the table 5, for virtually every dependent measure the predictive error (DIC value) was lower in the more complex model that included surprisal.One exception was regression probability, in which the phrase-structure based grammar predictions did not reduce DIC.
For fixation durations (tables 6, 7 and figures 4, 5), in general both versions of surprisal had a significant effect in the predicted direction (that is, longer durations for higher surprisal values).One exception was the effect of phrase-structure based surprisal on rereading time; here, reading time was longer for lower surprisal values.However, since the rereading time data is sparse (about 1/10th of the other measures; the sparseness of the data is also reflected in the relatively wide confidence intervals for the coefficient estimates of rereading time), it may be difficult to interpret this result, especially given the consistently positive coefficients for surprisal in all other dependent measures.and above the other predictors: an increase in surprisal predicts a greater likelihood of a regression.Phrasestructure based surprisal is not a significant predictor of regression probability, but the sign of the coefficient is also negative, as in the dependency-based model.

Discussion
The work presented in this paper showed that surprisal values calculated with a dependency grammar as well as with a phrase-structure grammar are significant predictors of reading times and regressions.The role of these surprisals as predictors was still significant even when empirical word predictability, n-gram frequency and word length were also taken into account.On the other hand, surprisal did not appear to have a significant effect on empirical predictability as computed in eye-movement research.
The high-level factor, surprisal, appears in both the so-called early and late measures, with comparable magnitudes of the coefficients for surprisal.This find-ing is thus hard to reconcile with a simple identification of early measures with syntactic parsing costs and late measures with durations of post-syntactic events.It may be that late measures include the time-costs of syntactic processes initiated much earlier.
The early effects of parsing costs are of high relevance for the further development of eye-movement control models such as E-Z Reader (Pollatsek et al., 2006) and SWIFT (Engbert et al., 2005).In these models, fixation durations at a word are a function of wordidentification difficulty, which in turn is assumed to be dependent on word-level variables such as frequency, length and predictability.Although these variables can account for a large proportion of the variance in fixation durations and other measures, we have shown that surprisal plays an important role as well.Of these three predictors, empirical predictability is an "expensive" input variable because it needs to be determined in an independent norming study and applies only to the sentences used in this study.This fact greatly limits the simulation of eye movements collected on new sentences.It had been our hope that surprisal measures (which can also be computed from available treebanks) could be used as a generally available substitute of empirical predictability.Our results did not match these expectations for the two types of surprisal scores examined here.Nevertheless, given the computational availability of surprisal values, it is clearly a candidate for being included as a fourth input variable in future versions of computational models.As Clifton et al. (2007) note, no model of eye-movement control currently takes factors such as syntactic parsing cost and semantic processing difficulty into account.While some of this variance is probably captured indirectly by empirical predictability, the contribution of this paper is to demonstrate how syntactic parsing costs can be estimated using probabilistic knowledge of grammar.
captain always poured a little rum in his tea"
(a) State-based surprisal calculation (b) Dependency grammar claims in parser states q Figure 3. Sketch of surprisal calculation.

Figure 4 .
Figure 4. Regression coefficients and 95% confidence intervals for the multiple regression using as predictors unigram and bigram frequency, 1/length, logit predictability and dependency grammar based surprisal.

Figure 5 .
Figure 5. Regression coefficients and 95% confidence intervals for the multiple regression, using as predictors unigram and bigram frequency, 1/length, logit predictability and phrase-structure based surprisal.

Table 1
Candidate explanatory factors for empirical predictability.

Table 3
Commonly used first-pass dependent measures of eye movement and the stages in parsing processes they are assumed to represent.

Table 4
Commonly used non-first-pass dependent measures of eye movement and the stages in parsing processes they are assumed to represent.

Table 5
Deviance Information Criterion values for the simpler model, which includes only the word-based statistical measures, and the more complex model, with surprisal added.Log unigram and bigram frequencies, 1/length, and the two surprisal variants as predictors of the so-called early fixationduration based dependent measures (single-fixation duration and first-fixation duration).All predictors were centered.An absolute t-value of 2 or greater indicates statistical significance at α = 0.05.

Table 7
Log unigram and bigram frequencies, 1/length, and the two surprisal variants as predictors of the so-called late measures.
table 8), dependencygrammar based surprisal had a significant effect over