Quirky Quotes and Needles in the Haystack : Tracing Grammatical Change in Untagged Corpora

This paper discusses pivotal theoretical and methodological problems of historical corpus linguistics. In two case studies from Swedish language history, the development of the epistemic adverb kanske and the group genitive respectively, it illustrates how the use of qualitative method in addition to corpus investigation can contribute to understanding grammatical change.


Introduction
Tracing grammatical change in historical corpora is a rewarding, albeit challenging task.In this paper I will discuss some of the theoretical traps and empirical pitfalls one is typically confronted with in historical corpus linguistics.First of all, we have to define what 'change' entails.This might seem rather straightforward, but 'change' is often confused with 'correspondence' (in Andersen's (2001) sense), and on closer inspection, a change may in fact comprise several smaller, primitive changes.One has also to bear in mind that change is gradual, which makes it difficult, if not impossible, to identify discrete stages in a change (or rather, in a chain of changes).And finally, change, particularly grammatical change, is not unidirectional, which makes it difficult to reconstruct it.
For the reasons given in the preceding paragraph, the only viable method for uncovering change in times long past is historical corpus linguistics, but this brings along its own share of issues.The historical material may be fragmented, chronologically discontinuous, or stylistically imbalanced, all of which seriously reduces the possibility of collecting random samples.Automatized quantitative studies are difficult, if not impossible to carry out, particularly when the corpora lack part-of-speech (POS) tagging.In the absence of such tags, all one can do is search for particular strings ((parts of) words or larger units), using regular concordance software such as WordSmith lexical analysis software (Scott 2004).Especially in older, non-standardized texts, one furthermore has to consider variation in spelling, which may be substantial.In this paper, I will discuss these theoretical and data-related challenges, illustrated by two case studies from Swedish language history.Specifically, I will discuss two types of change, which I will term 'quirky quotes', and 'needles in the haystack', and the crucial role they may play in the qualitative approach to change.
The nominal origin of the suffix, the Latin noun mens (FEM) 'mind', is traditionally illustrated by the following passage from Ovid's Metamorphoses XIII: (1) consolor socios ut longi taedia belli encourage-1SG allies-ACC so.that long-GEN boredoms-ACC war-GEN mente ferant placida mind-ABL bear-SUBJ.3.PL quiet-ABL 'I encourage our allies so that they may bear the boredom of the long war with a quiet mind' In the course of time, mens went through a series of metonymic meaning changes illustrated in (2) (for details of this development see (2) 'mental state of the participant in the event' > 'way in which the event is perceived' > 'manner in which the event takes place' Once the meaning 'manner' had been established, mente collocations were no longer restricted to adjectives expressing states of mind, sparking an increase in type frequency to the extent that adverbials involving mente replaced a number of the Classical Latin adverbs ending in -e and -iter.Crucially however, the change from FEM.SG.ABL noun to derivational suffix was not a straightforward one.Thus Bauer (2003: 447), in a careful study of the Vulgata bible, has shown that mente adverbials were far less frequent than animo adverbials, yet the latter did not develop into a suffix.A second problem is that it is not exactly known whether mente adverbials originated in the spoken or in the written language (see Hummel 2000 for discussion).Other loose ends in the history of MENTE include its morphological status as well as distributional differences at different stages of development in different Romance varieties (Norde 2009: 44-46).Finally, the development from Latin mens to MENTE has been shown to be discontinuous in some varieties, e.g.Spanish, in which the form -mente was borrowed, presumably from Aragonese, Catalan or even French, to replace the native suffix -mientr(e) (Torner 2005: 139).
What this yet inexhaustive discussion of Spanish -mente shows is that the history of this suffix is far more complex than a simple correspondence might suggest.Nevertheless, correspondences play an important role in historical research, because the changes themselves often go unnoticed by members of a speech community.More often than not, a striking correspondence forms the point of departure for a historical investigation.
Another illustration of the peril of basing linguistic reconstruction on correspondences alone is the alleged diachronic link between the s-genitive in Norwegian (example (3)a), and the socalled possessor doubling construction (example (3)b), in which the possessor is marked by a following possessive pronoun: (3) 'the house of the old man with the beard' These two constructions are not only similar in that they express a possessive relationship; they both occur in 'group genitive' constructions (as in the examples above), and both are largely confined to animate possessors.On the basis of these similarities, it has been suggested by Fiva (1987) and Lødrup (1989) that the s-genitive is a reduced form of the (reflexive) possessive pronoun.At a superficial glance, this may seem a phonologically plausible change, and one corroborated by the functional similarities between the two constructions.However, there is abundant historical evidence to the contrary.First, the (enclitic) s-genitive has been shown to derive from the former (inflectional) genitive case (Norde 1997;Trosterud 2001), and secondly, the s-genitive is at least as old as the possessor doubling construction (Norde 2012).

The nature of grammatical change
Grammatical change is very complex, encompassing changes at several levels.For example, when a noun develops into a preposition, there can be said to have been 'a shift from noun to preposition', but that is just the categorical reanalysis involved.What Lass (1997) means by 'micro-stories' in the quote in the preceding section are the changes at different linguistic levels.For example, when English to be going to grammaticalized into the future auxiliary gonna, it did not only change category, but went through a series of changes, which I will term 'primitive changes'.Primitive changes in the case of gonna include phonological reduction, loss of inflectional properties, and semantic bleaching.These changes are not entirely independent of one another -for instance, semantic bleaching (from 'moving on foot towards a certain goal' to FUTURE) opens the door to an expansion of contexts and an increase in frequency which in turn may result in phonological reduction.Yet they have to be examined separately, if only because they do not always occur simultaneously.
Another important observation in the functional-typological approach to grammatical change (the prevalent approach in grammaticalization studies) is that change is gradual, which may be reflected by synchronic gradience (Traugott/Trousdale 2010).When a new structure arises, the old one does not disappear at that very same instant; it continues to co-exist with the new structure, sometimes for a considerable period of time.This gradualness can be represented as follows (Hopper/Traugott 2003: 49): When construction A changes into B, it coexists with B, and in fact need not disappear at all (hence, the last stage is put in parentheses).
A final point to be made about the nature of grammatical change concerns directionality.In the 1990s, which saw a revived interest in grammaticalization, the view that grammatical change was unidirectional was quite widespread.This unidirectionality implied that lexical items could change into grammatical items and go on to adopt more grammatical functions, but not vice versa; in other words, there could be no degrammaticalization.Some early studies (Campbell 1991;Ramat 1992), however, provided evidence that counterdirectional change, though rare, is by no means impossible, and while the body of counterdirectional evidence grew (Norde 2009), it became increasingly recognized that unidirectionality of change is a statistical universal, not an absolute one (e.g.Haspelmath 2004: 23).This has serious implications for grammatical reconstruction, for in the absence of historical evidence, a 'less grammatical' form cannot be reconstructed as historically prior to a 'more grammatical one', at least not with absolute certainty (for discussion of this issue see Norde 2009: 36-41).

Implications for historical linguistics
Summing up this section, diachronic linguistic research proceeds in two steps:  Step 1: identifying correspondences;  Step 2: identifying changes in order to  establish whether synchronic correspondence reflects diachronic correspondence;  identify the micro-changes that resulted in the correspondence.
The second step is essential -as unidirectionality is not an exceptionless principle of change; changes cannot be reconstructed.Needless to say, reconstruction may be the only method available, for instance in languages that lack written historical records.But whenever such records are available, I think they simply cannot and should not be ignored, tiresome as historical corpus linguistics may be (cf.section 4).In addition, we need to bear in mind that change is gradual, which implies that intermediate stages will also show gradience.As a result, lots of texts will have to be scrutinized in order to detect all micro-changes involved.This raises a number of methodological issues, to which I now turn.

Building a corpus
Historical linguists have the disadvantage of not having access to the competence of speakers of past stages of a language, and hence they have to rely on evidence from historical records and/or linguistic reconstruction, both of which bring along their own problems.One of those problems, raised by Janda/Joseph (2003), concerns the over-representation of high-prestige sources: "there is little we can do to change the circumstance that the texts which most often tend to be written and preserved are those which least reflect everyday speech" (Janda/Joseph 2003: 17).Citing Labov's famous study of Philadelphia English, they argue that speakers tend to be much more consistent in spontaneous speech (in Labov's study: in the realization of /ae/ in sad versus /aeh/ in bad) than when reading word-lists aloud.This is probably because writing favors both conservatism and hypercorrection.In other words, the variation attested in older texts need not reflect variation in the spoken language."Broken threads" in language history pose another challenge (Janda/Joseph 2003: 19).This is notoriously true for English, where most of the oldest records are written in the Wessex dialect spoken in the West-Saxon kingdom, which was both politically and culturally dominant at that time, whereas Modern English descends from Mercian, spoken in and around London, which became powerful in the Middle Ages.This means that there exists no uninterrupted timeline from "old" to "modern" English.
Both problems are also significant in historical texts from Sweden, on which the case studies in the next section are based, with the additional problem of two different systems of writing.
The oldest Swedish texts (circa 800-1100) are runic inscriptions.Although they may be considered a very rich source of the language of that time (more than 3000 inscriptions have been preserved), they may be difficult to interpret for two reasons.Firstly, there were only 16 runes for some 30 phonemes, and secondly, subsequent identical sounds were usually not repeated.For example, the phrase ok Guðs móðir 'and God's mother' was often carved ukusmuþir -there was only one rune for both /k/ and /g/, and this was not repeated, even though ok and Guðs are two separate words (Palm 2004: 112-113).From the 12 th century, there are no Swedish sources, neither runic nor written -the first texts written in the Latin alphabet are from the 13 th century.This means that there is a crucial gap in the documentation of Swedish language history.Another problem concerns the nature of the sources: the oldest manuscript texts were provincial laws (with long oral history) and charters, written in a very different style.In the next centuries, most texts were translations (among them religious treatises, legends and courtly literature, all translated from Latin, French or German).In other words, attested differences between different texts need not (only) be chronological, they may also be due to register, style, or foreign influence (or a combination of these).

Dealing with negative evidence
Another problem with diachronic textual evidence is observed by Traugott (1989: 34): All claims about the order of development that are based [...] on written records and evidence from grammars and dictionaries, must be regarded with caution.As is well known, attestation is often a matter of accident.Furthermore, it does not necessarily reflect changes in the spoken language.What is significant is cumulative evidence from different but related semantic domains, and, wherever possible, from other languages, of the same order of attestation among exemplars, whatever the time lag.Lehmann (2004: 172) similarly argues that the absence of a given form or construction does not necessarily imply that it did not exist at the time, a problem for which he coined the apt phrase "non-demonstrability of non-existence".Janda/Joseph (2003: 15) likewise note that there may be "accidental gaps in the historical record".They provide the example of Ancient Greek éor which does not appear in written records before the fifth century AD, but must be much older than that, since it refers to a female relative of some kind and derives from PIE *swés(o)r by regular sound laws.The non-occurrence of this word in the massive body of texts from the preceding centuries is purely accidental.Unfortunately however, Janda/Joseph (2003) do not reveal how frequent this word was in documents from the fifth century (and onwards?), but it cannot have been too frequent, given that the exact meaning of the word is not even known.Hence, it may have been extremely marginal, possibly confined to a very small and non-prestigious part of the Ancient Greek speech community.Therefore I think that this is a problem which should not be overemphasized.Surely we must always be aware that non-occurrence is not tantamount to non-existence, but the relative infrequency of this phenomenon should not inhibit us from using historical data.

Case studies
In this section, I will briefly present two case studies posing two different kinds of problems one frequently encounters in historical corpus research: "quirky quotes", and "needles in the haystack".Quirky quotes are examples of constructions that are perfectly inconsistent with attested patterns of development.The question is what to do with them -dismiss them as plain errors, or try to account for them?This type of data will be discussed in section 4.1.
Needles in the haystack are of a very different kind -they are examples that are extremely difficult to trace in untagged corpora, simply because they are not orthographically salient; hence they cannot be detected by means of regular search queries.An example will be given in section 4.2.

Quirky quotes
Quirky quotes, as stated above, are data that do not conform to the expected or attested path of development.They are "the odd ones out" and can, in principle, be dealt with in two ways.
We can either dismiss them as "slips of the feather", or try to explain them as changes in their own right (and if we fail, decide they were probably slips of the feather after all).In this section, I will discuss a case in which the quirky quotes eventually turned out to be highly relevant, with some serious implications for the initial hypothesis.
This case concerns the development of epistemic adverbs in the history of Swedish.These are sentence adverbs, meaning 'maybe', that originate in the univerbation of a modal verb meaning 'can' or 'may', and a main verb meaning 'happen': kanske, kanhända, måhända, törhända (Norde/Rawoens/Beijering in prep.).2From the point of view of Swedish main clause syntax (Beijering 2010), these adverbs are very interesting because they may violate verb-second (V2), i.e. the syntactic rule that the finite verb always appears in second position.
In the examples below, (5)a and (5)b are V2-clauses, but in (5)c, it is the sentence adverb that appears in second position, whereas in (5)d the subject appears in second position.Example (5)e, finally, illustrates the phenomenon of insubordination (Evans 2007), in which a subordinate clause is not bound by a full matrix clause, but by an adverbial phrase.
( The etymology of epistemic adverbs as deriving from an epistemic verb phrase (EpVP) is fairly uncontroversial, but note that at this point this is merely a correspondence, not a change (cf.section 2.1).A possible path of development for the adverb kanske has however been suggested by Wessén (1967).In Wessén's scenario, the development comprises five stages, which are illustrated below by Modern Swedish equivalents.
Stage I: The EpVP forms part of a full matrix clause, which is followed by a subordinate clause: ( In order to test whether Wessén's scenario is reflected in historical texts, we carried out a large-scale corpus investigation into epistemic adverbs and epistemic verb phrases in the history of Swedish (Norde/Rawoens/Beijering in prep.).The corpus was about 1,668,500 words in size, comprising texts from the late 14 th century to the end of the 18 th century.From this corpus, we selected all instances of the sentence adverb kanske (which is the most common epistemic adverb in Swedish) as well as the infinitive forms (including their spelling variants) of the verbs meaning 'happen': ske and hända.All instances in which these main verbs combined with the modal verbs kunna 'can', må 'may' or tör 'may' were further analysed.Table 1 summarizes the total number of relevant constructions.
4 Andréasson (2002: 43) suggests that kanske in older Swedish may have had two functions: one as a fully developed sentence adverb, and one as an EpVP (orthographically identical to adverbial kanske).This phrasal kanske, then, would have been moved to a position where it could no longer be replaced by a verb phrase.To my mind however, analysing kanske as a phrase rather than an adverb does not really solve the problem of why a new non-V2 construction should arise in the first place.Among the results of this corpus investigation were a few quirky quotes that at first seemed difficult to explain.Three of them are quoted in (12).In ( 12)a, kan ske is written as two words, yet it cannot be analysed as a (subjectless) matrix clause, because it is not possible to add a subject and a subordinator (compare (12)a', which is ungrammatical).It clearly functions as an adverbial, as in Wessén's Stage V above, even though the adverb was written as one word from Wessén's Stage II onwards.This might suggest that univerbation of the modal verb and the main verb had not been completed when the sentence adverbial uses arose.However, orthography was not yet standardized at that time, and it was not unusual for compounds to be written as two words.In other words, kan ske may have been a single (compound) word anyway, in spite of its spelling.A far more problematic example, however, is (12)b: in this example, univerbation cannot possibly have occurred, because the adverb wäl is inserted between the modal verb and the main verb.Example (12)c, finally, is quirky for yet another reason, because the modal verb is inflected for past tense, whereas it is assumed that the adverb kan ske arose from a construction in which the modal was in the present tense (and indeed, the vast majority of examples of epistemic VPs involve the present tense).

MiSw EMoSw
( The examples above clearly do not fit with Wessén's scenario, but they are too frequent to be ignored and beg for an explanation.And the explanation Norde/Rawoens/Beijering (in prep.)suggest is that the matrix clause was not the only source construction for the sentence adverb kanske.The examples in ( 12), we propose, do not derive from a sentence initial matrix clause, but from a parenthetical clause inserted in a main clause.The two source constructions are illustrated in figure 1.A is Wessén's scenario with a main clause as the source of the adverb, B is the alternative construction with a parenthetical clause as the source of the adverb.The quirky quotes, then, do not turn out to be quirky at all -they are simply indicative of an alternative route to adverbhood, which might not have been discovered otherwise.Moreover, the second scenario can account for the occurrence of non-V2 constructions, such as (11)a.

A.
Det kan ske att han kommer.

Needles in the haystack
Untagged corpora can be extremely challenging for the historical linguist who wants to study morphological change.With regular software (e.g.WordSmith, Scott 2004) it is possible to search for words, strings of words or (using wildcards) parts of words, but this is usually not very useful for identifying changes in, say, inflectional morphology.Inflections are typically short, mostly monosyllabic, and often even monophonemic.Obsolescent morphology is even more problematic because it is evidently impossible to search for the absence of inflection.In this section, I will discuss a case study of changes in inflectional patterns, where these problems present themselves.The case study concerns a part of the intriguing history of the development of the s-genitive found in English, Danish, Norwegian and Swedish.Once an inflectional suffix to mark the genitive case of some masculine and neuter singular nouns, adjectives and pronouns, it is at present a once-only marker which is attached to full noun phrases. 6The most impressive reflections of this change are so-called 'group genitives', in which the s-genitive appears on the very right edge of an NP containing a postmodifying PP (examples ( 13) and ( 14)) or relative clause (examples ( 15) and ( 16)).Note that in group genitives, the word to which the s-genitive is attached is invariably the final one, irrespective of word class.Thus the s-genitive is attached to the object form of a personal pronoun in ( 14), to an adverb in (15), and to a tensed verb in ( 16).
( Since the s-genitive is monophonemic and not orthographically marked (unlike the English sgenitive which is separated from its host by an apostrophe7 ), finding examples of it is a prototypical needle-in-the-haystack task.The examples above were found in Google searches using specific words or strings of words that might be the final part of a group genitive construction.For example, many present tense forms followed by S, such as jobbars, are not homonymous with any other Swedish word form.Nevertheless, searching jobbars in Swedish web pages yields many false positives, i.e. spelling errors.8 In the absence of annotated corpora, empirical studies of the rise of the Swedish group genitives have not been carried out yet.Individual examples have been noted -the oldest example attested so far being Swen i Kleffs tompt (1452) 'Swen of Kleff's property' (Delsing 1991: 28).In a paper on the history of the Swedish group genitive (Norde 2013), I used the following method.Point of departure was the observation that group genitives involving lexicalized semantic units, such as mannen på gatan 'the man in the street', or kungen av Preussen 'the king of Prussia' (Thorell 1977: 49;Teleman et al. 1999b: 131)  Furthermore, figure 2 shows that the number of occurrences is too small to draw firm conclusions about the chronological order of the four construction types.Type 1, [[NP]genX[PP]] (as in example (18)a), is clearly the oldest pattern, as it occurs in the oldest text in the corpus, dating from the end of the 14 th century.The group genitive (type 4 as in example (18)d) is the youngest, and does not occur before 1585.But it occurs only five times in the entire corpus, in the works of three authors: Per Brahe (1585), Carl Gyllenhielm (1640), and Agneta Horn (1657).It is not attested in younger texts, unlike types 1-3, which predate the group genitive.Another striking observation is that most authors in the corpus use more than one construction, Per Brahe even uses all four of them.To conclude this section, what I hope to have demonstrated with this case study is that needles in the haystack are definitely worth looking for.In spite of their relative infrequency, they may yield important information on complex changes such as the rise of the Swedish group genitive.

Conclusions
This paper started out with some notorious problems that historical linguists find themselves confronted with.Some are related to the sources themselves -the material available today is the result of "accidents of history", and native-speaker judgments are obviously not available.For these and other reasons, historical corpora do not really lend themselves to large-scale quantitative investigations.However, this is not necessarily a bad thing.The qualitative method illustrated in this paper has several advantages: it enables much more fine-grained analyses and may reveal the delicate interplay between changes at different levels (phonology, morphology, syntax, semantics, pragmatics).Finally, detailed qualitative analyses may disclose data that turn out to be crucial to a correct understanding of the changes involved: quirky quotes because they force the researcher to consider alternative pathways of

Figure 1 :
Figure 1: Source constructions of the Swedish adverb kanske 'maybe'

Figure 2 :
Figure 2: Relative frequency of construction types

Table 1 : EpVPs and kanske in the corpus
So up till now they had not seen a single enemy, perhaps they would not even get to see one' 12) a. så hadhe iagh nu kan ske vahrit en annan carl

Table 2 : Genitive constructions (Norde 2013)
are the only[[NP][PP]] type of group genitive constructions that is accepted in normative grammars.Moreover, in online documents group genitives appear to be preferred when the possessor is such a semantic unit.For instance, Google found 42 instances 9 of the group genitive drottningen av Englands 'the queen of England's', as in (17)a, and only eigth instances of drottningens av England 'the queen's of England', as in (17)b.10Itturned out that even these constructions were extremely rare in the period under consideration.As is shown in table 2, only 81 relevant constructions were attested in a corpus of 1,228,148 words.Excerpting these manually would have been extremely time-consuming, obviously, but the method outlined above has the disadvantage of finding two particular construction types only.Undoubtedly, there are other group genitive constructions out there, but since their particular form is unknown, they will remain unnoticed unless one reads one's way through all texts.
Delsing's (1991)ing's (1991)observation that a[[NP][PP]] construction was the oldest group genitive he was aware of, I decided to focus on this particular type.Using sources from the late Middle Swedish and Early Modern Swedish period (covering the years 1380-1758), I generated concordances for the two prepositions that were most commonly used in this group genitive construction, to wit i 'in'(spelled <i>, <j> or <ij>), and af (<af>, <aff> or <av>), and excerpted all relevant constructions manually.These were complex NPs, consisting of a noun denoting some noble title (e.g.'king'), optionally followed by a personal name, which forms a semantic unit with a PP consisting of a preposition plus a geographic name (e.g.'of Denmark').Some of these examples were group genitives, but this particular search method produced other types of[[NP][PP]] genitive constructions as well.These are exemplified in (18).In the abstract schemas, [NP] is the noble title, optionally followed by a personal name; [PP] is the prepositional phrase that modifies [NP]; X is the possessee, the head which[[NP][PP]]gen is attributive of; and subscript gen marks the position of the genitive marker(s).