Investigating Syntactic Variation World-Wide

From 2014–2018 I did my Ph.D. in quantitative linguistics. Here’s what it was about.

July 16, 2022

You’re probably reading this because I sent you here after you asked me about my Ph.D. With this in mind, I will keep this as accessible as I can. However, if you prefer a more academic account, check out my article in the Journal of English Linguistics.

So, my Ph.D. was about probabilistic grammar. You might not have heard of it. So let me give you an example.

But wait.

Before I can explain what probabilistic grammar is, you have to know what an alternation is. You’re looking at an alternation whenever there is more than one way of saying the same thing. For instance, you could pronounce the r-sound at the end of “car” or you could leave it out—both utterances still mean the same thing. This alternation is phonetic, but there are alternations on all levels of language—phonetic, lexical, syntactic, etc.

In my Ph.D., I looked at a syntactic alternation, the English genitive alternation. This is the choice between the s-genitive and the of-genitive. For example, you can say “my best friend’s car” or “the car of my best friend”—again, both mean the same.

Now back to probabilistic grammar.

Probabilistic grammar assumes that grammar is not categorical but gradual. This is not obvious because grammar books often tell us differently. Look at this example from English Grammar in Use. The book says that

  • “We use -’s […] mostly for people or animals” and
  • “For things […], we normally use of” (Figure 1).


Figure 1: Categorical rules about the English genitive alternation in English Grammar in Use. It frames animacy as a fixed rule.

However, if we look at the book’s cover, it says, “the world’s best-selling grammar book” and not—according to its own prescription—“the best-selling grammar book of the world” (Figure 2).


Figure 2: By using “The world’s best-selling grammar book”, the cover of English Grammar in Use contradicts its own categorical rules on genitive use.

This illustrates that the animacy rule (“people or animals” vs. “things”) is not as categorical as the book might make you think, and also that animacy alone does not fully explain the alternation. In fact, animacy is an essential factor, but there are many others that have a significant effect on genitive choice.

Here are just two examples:

  1. Constituent length. Remember the example from above, my best friend’s car. If the so-called possessor, my best friend, gets longer, the of-genitive gets more and more probable. So if it were my incredibly beautiful and smart best friend, people would be more likely than before to say the car of my incredibly beautiful and smart best friend. This tendency to place longer constituents last can be observed not just with genitives but in many linguistic situations. One compelling explanation is the Easy First Principle, which states that we tend to place easier constituents first because they are cognitively readily available.

  2. Final sibilancy. A sibilant is a hissing sound, such as “s” or “sh”. If a possessor ends in such a sound (e.g., Mercedes), the s-genitive is less likely because in constructions like My Mercedes’s headlights, this very sibilant and the “s” from the genitive marker collide. This, again, is not a genitive-specific effect. Often when two similar things are close to each other, it sounds weird. This is sometimes called the Horror Aequi Effect.

So, there are multiple significant factors (and I think I found more than ten in my analyses), and they are additive. If there is both a long constituent and a final sibilant, their effects add up. Still, the alternation is not categorical, so there is no single factor or combination of factors that guarantees a certain realization or excludes another.

This is probabilistic grammar.

But if grammar is probabilistic, how do we describe it best?

If you know a bit about statistics, you might know that categorical choices like the genitive alternation can be modeled with classification models such as random forests or logistic regression. These models yield coefficients that describe all factors’ (animacy, constituent length, final sibilancy, etc.) individual contributions while controlling for all others. Grammar can thus be described with a formula that contains a bunch of coefficients.

And this is an important implication of probabilistic grammar for linguistic theory: In this framework, we can not only describe grammar but also people’s knowledge of grammar as something comparable to the coefficients of a statistical model.

But back to my research.

Smart people have discovered that probabilistic grammar can vary across speech communities. In other words, there is not just one grammar, but there can be many. So, the coefficients of a grammatical model as described above can vary across communities (this could be cities, professions, countries—basically any group).

In my research, I investigated two things: (1) how stable grammar is across countries, and (2) in case there is variation, whether it is dependent on the linguistic history of the countries.

And I found that the English genitive grammar is quite stable. I found that none of the factors (animacy, constituent length, final sibilancy, etc.) ever change effect direction across countries. For example, animacy always makes the s-genitive more probable. But still, the strengths of the effect vary.

To visualize this, I chose the image that you can see on the cover of my book (Figure 3). It shows a ball in a bowl, floating on water. This ball shows the stability of English genitive grammar. It moves around (varies), sometimes even quite erratically, but its movement (variation) is confined to the bowl.


Figure 3: My book cover with an illustration of my main finding: English genitive grammar (symbolized by the ball in the bowl on the water) varies, but its variation is confined.

Regarding my second research question (which was: if there is variation, is it dependent on socio-historic factors?), I found that the differences in the strength of the animacy constraint do indeed correlate with the type of variety that is spoken in the respective countries.

Here’s what I mean by “type of variety.” There is a crucial (albeit crude) distinction between Inner Circle and Outer Circle varieties. Inner Circle varieties are those varieties of English that are spoken in countries where English is the dominant language. This is the case in England, Ireland, or New Zealand. Outer Circle varieties, on the other hand, are spoken in countries where there are other very important languages, such as India, Jamaica, or Singapore.

I found that the animacy constraint is stronger in Inner Circle varieties than in Outer Circle varieties. As Figure 4 shows, the two variety types vary considerably. When animacy is present, the probability of s-genitive use is higher than 30% in all Inner Circle varieties, while it is below 30% in most Outer Circle varieties. Singapore in an exception here. Formally an Outer Circle variety, it is close to the group of Inner Circle varieties. However, the literature on post-colonial Englishes suggests that Singapore English is on its way toward an Inner Circle variety, which very much in line with these findings.

Figure 4: This visualization of probabilistic differences in genitive choice shows that animacy has a stronger effect in Inner Circle varieties than it does in Outer Circle varieties. Clicking on “Show L1 Tendency” adds the dimension of genitive order in local languages. It reveals one of my unexpected findings.

Have you noticed the button in Figure 4? Clicking it adds another dimension, the s-genitive order in local languages. For the Inner Circle countries England, New Zealand, and Ireland, it is close to 0% because there is not much of a tendency in either direction. The most prevalent language spoken in these countries is English, which allows both. In Canada, however, there’s also French, and French uses the of-genitive order only. Considering the number of French speakers in Canada, this puts the country close to –20%, which is a tendency toward the of-genitive.

We can see that the tendency of genitive order in local languages and the strength of the animacy constraint are negatively correlated. The more local languages tend toward s-genitive use, the less is it triggered by animacy. (Notice that the Philippines is an outlier here; there are good reasons for this, which I will not discuss.) This is an unexpected finding because previous research found that tendencies in local languages sometimes carry over to English. I interpreted this as follows: Genitives are used in a way to maximize the distinction to the local languages. This might serve the purpose of social distinction because English is usually associated with job opportunities and higher social status, and language users can show distinction best by using language in a way that is maximally different from the tendencies in local languages.

So, grammar is probabilistic and might change over time, and one motivation that governs this change might be social distinction. If I had stayed in academia, I would have continued researching this possibility.

Anyway, that was it, and I hope this gave you an overview of my Ph.D.

But before I stop, I must add that this post is incomplete without referring to my supervisor Benedikt Szmrecsanyi, without whose guidance and support I could not have done this project. I also want to give credit to Stefan Gries, from whom I learned a lot about statistics and data analysis.

Posted on:
July 16, 2022
Length:
8 minute read, 1568 words
Categories:
Linguistics
See Also: