Having fun with phrase structure grammars: Midsomer Murders and Beatles

This post is about phrase-structure grammars, which can be both entertaining and educational. If you're a linguistics student, you will be interested in this. We’re going to learn how to define a little set of rules for a made up language, and then generate possible sentences in that language based on the rules. We can also use it to test if something is grammatical in our tested language.

You may already be familiar with phrase structure from linguistics class, or parsing in
programming. Regardless, this introduction is accessible for everyone - including novices.

We will first learn the basics of these little rules, and then illustrate by generating
random plot summaries for possible episodes of the TV show Midsomer Murders (à la the Midsomer Murders Bot on twitter) and also Beatles lyrics.


Even Barnaby can see the templatic nature of the show.

How many nas do we need to generate this song?
Nearley parser
We will be using the Nearley parser, a computer program that helps parse sentences given pre-defined rules. The Nearley parser uses context-free grammar rules. The Nearley parser is written by Kartik Chandra at Stanford. The program is also supported and licensed by MIT. Guillermo Webster from MIT made an online interactive environment for web browsers where we can play with this parser and construct little context-free grammars and see the results right away. You can play with this parser here:

http://omrelli.ug/nearley-playground/

Since the Nearley parser can run in a web browser, you’re not going to have to install anything or download anything, just follow the links and copy-paste from this post. It’ll be real simple and easy, and then at the end we’re going to discuss some more real-world applications.


Please note that the interactive environment, the Nearley playground, works best on a Google Chrome browser. If it crashes, it may be because you made too complicated a grammar for your web browser to handle (ambiguities or recursion). If it crashes, delete your cookies and restart the browser. It remembers what you did from last time, and if whatever you did before caused it to crash it will unfortunately crash again. The best thing is to set your browser to never save cookies for this domain or have it in private browsing/incognito mode. Hopefully this won't happen, but if it does you know what to do.


Nearley as a parser exists outside of this specific web browser environment. You can read more about it here.


Basics
We will be writing rules that take an item at the left and puts its definition/content to the right. Let’s just start with something really simple so that we get the hang of it, and then we can go through more details on how to write these little grammars. Rules are defined by using an arrow, "->". Here’s a very simple example of a set of rules that will just generate one simple sentence:
SENTENCE -> SUBJ VERB OBJECT
SUBJ -> "I "
VERB -> " love "
OBJECT -> "chocolate"
This little grammar will generate one sentence, and one only. Perhaps that’s all that needs to be said, after all?
I love chocolate
It literally only knows these three words and it only knows to string them together in this particular order. It can't even say "Chocolate I love".

This grammar says that a string in this little language needs to have something that is an SUBJ then a VERB and then an OBJ (in that exact order). It then defines what those things are. Words in this system need to be within quotation marks. We can think of the things in CAPITALS as meta-categories, labels, and the things between quotation marks as the actual lexicon. The lexicon is what will actually make up the language, and the things in CAPITALS are the rules that will govern how these are combined.

In this post, we will use capital letters for labels, but it is not strictly necessary for the grammar to work. You can also name things exactly whatever you want. For example, this would generate exactly the same grammar and sentences.

BIGGESTBLOB -> BLAB BLOB BLEB
BLAB -> "I "
BLOB -> "love "
BLEB -> "chocolate"


Output:


I love chocolate


It’s helpful to you and to others who might read your code if you name things sensibly, but it’s not technically required by the system itself. The above little BLOB-grammar is just as formally good as the other.

If you are doing this as a linguist or linguistics student, you'll of course need to conform to whatever notational conventions your framework has or your teacher requires of you. Don't hand in homework with "BLOB" instead of "VP"..

The items that are the actual lexicon are known as "terminals" and the others as "non-terminals". The terminals can be thought of as the "end-stations", they don’t themselves refer to anything else - they are the actual final output. Whereas, the non-terminals do not end up in the output, they just define other things. Here’s the terminals and non-terminals of the examples above

Non-terminals = SENTENCE, SUBJ, VERB & OBJ (or BIGGESTBLOB, BLAB, BLOB & BLEB)
Terminals: "I", "love" & "chocolate"
When using a context-free grammar to generate language, the terminals are going to be words, phonemes and/or morphemes.


Basics of online interface
You can either tell the set of rules to generate language, or you can ask it if a particular string is correct or not (according to the rules you just defined).
  • you can write a little set of rules and items under "Basic grammar" to the left
  • you generate strings by clicking "generate" on the lower right
  • you can test if strings are correct or not by "add test" to the right
  • remember to delete cookies and reboot if it crashes.
Meta-comments in the code
We can make our little grammar a bit more clear and separate out the rules (non-terminals) and lexicon (terminals) using non-scripting lines of comments, denoted by the hash-symbol. Everything that is on the line after the hash will be ignored by the program as it runs the code, it's for meta-commentary to the reader of the script and will not end up in the output.


Here is an example of a slightly longer grammar with meta-commentary. In this language, we have separated out the kinds of pronouns that can be objects from those that are subjects, and we are using | to mean "or".
#RULES
SENTENCE -> SUBJ PREDICATE OBJECT
SUBJ -> NP_S
PREDICATE-> VERB
OBJECT -> NP_O
NP_S -> PRONOUN_S | N
NP_O -> PRONOUN_O | N
#LEXICON
N -> "chocolate" | "pandas"
PRONOUN_S -> "They " | "You "
PRONOUN_O -> "me" | "you"
VERB -> "love "
This is called "commenting out" lines, i.e. making it so that they will not be understood as actual code. Basically all programming languages do this. R & Python even uses #, just like Nearley. LaTeX on the other hand uses "%". Remember to not leave too much "crap" in your comments, if you share the code or make a publication out of it you don't want there to be unflattering bits of comments lying around.. Some more notation
When we list alternatives, we will use "|" to mean "or". When we want to denote that something is a non-terminal, an actual string, we put it in quotation marks (non-smart/curly quotation marks). When we want to instruct the grammar to repeat an item, we use regex operators. See the table below for the basic symbols you'll need. Nearley cheat sheet symbols
Quantifier
Description
example
:+
one or infinite of whatever is before
"cake ":+,
"cake", "cake cake cake", "cake cake"
:?
zero or one of whatever is before
"cake":?
"", "cake"
:*
zero or infinite of whatever is before
"cake":*
"", "cake", "cake cake cake cake"
|
or
"cake" | "muffin"
"muffin"
""
whatever is in between is a string
"cake"
"cake"
()
grouping of non-terminals
Line -> (H H):?
H -> "hey" | "ho"
"" or "heyho" or “heyhey” or…
\n
line break
"Hello \n Hi"
"Hello
Hi"
.
Any character (but whitespace
.
"N" or "<" or "x"
#
comment
NP -> "word"
#this is a comment
"word"


Differences between traditional ways of writing phrase structure rules and Nearley
If you have learned about phrase structure rules before, chances are this notation looks a bit unfamiliar to you. That is because Nearley is more similar to programming whereas traditional conventions that linguists use are more similar to writing. Chances are you have seen notation that looks something like this:
Don't worry! The differences between the conventional linguist way and Nearley are few. Let's go through it.


The main differences lie in optionality and listing alternatives. Alternatives in Nearley are denoted by the pipe symbol "|". This symbol means "or". Rule: NP -> "bird" | "book"
Output 1: "bird"
Output 2: "book" In this parser, if something is optional ("occur once or never")  we mark it with ":?". ":?" is a quantifier that means that whatever is before occurs 0 times or 1 time. There are other quantifiers, this table below lists all of them.
Quantifier
Description
example
:+
one or infinite of whatever is before
"cake ":+,
"cake", "cake cake cake", "cake cake"
:?
zero or one of whatever is before
"cake":?
"", "cake"
:*
zero or infinite of whatever is before
"cake":*
"", "cake", "cake cake cake cake"
The quantifier will pertain to whatever unit is precisely before it. If the unit before is a terminal string, the string needs to be surrounded by quotation marks. If it’s a non-terminal, this isn’t necessary.
Example
Output
lyrics -> line:+
line -> "na"
"na", "nananana", "nana", etc
lyrics -> "na":+
"na", "nananana", "nana", etc
lyrics -> "n" "a":+
"na", "naaaaa", "naaa", "naaaaaa" etc
These conventions with the pipe-symbol for "or" ("|")  and the quantifiers (*+?) are really common in many programming languages for doing text searches. It’s the same in Python, R etc. It’s good to learn these conventions if you want to do anything quantitative with text, regardless of discipline. They're very handy in linguistics when you want to search through corpora, databases or texts. The last thing that is different is that Nearley will not put spaces between things unless you explicitly tell it to. You can solve that either by putting spaces inside your terminals, or by defining a non-terminal as a space. More on this later. Here is a summary of the differences between the conventional way and Nearley
Here is the same grammar and lexicon that we saw above, in the conventional linguist way and Nearley.
Spaces
Spaces will not be inserted into the output unless you specify (either with a non-terminal or within the strings). In the example above, we’ve put spaces into the terminal objects so that the output becomes sensible. If they weren’t there, we’d get: "Marysleptalittlelamb". Another way of doing this would be to define a terminal that is space. In these examples, we will use underscore to signify space.

SENTENCE -> NP _ V _ NP
NP -> "Girls"|"pandas"
V -> "love"
_ -> " " Generates:
Girls love pandas Advanced: recursion and ambiguity
It is easiest if you make your grammars un-ambiguous and non-recursive. It'll be kinder to your browser and processor, and it'll probably make for a more readable grammar. Nearley can handle recursion and ambiguity, though. It is just that it can cause trouble, so it's easier to avoid it if possible, in particular in the beginning. If you want to read more about that go here. Examples
These are the basics you need, now we can get going creating some fun things! Did you ever play "Mad Libs" in school? It's a game where you take a template and fill it with some words, and have lots of fun (apparently). There's a website here for kids where you can do this. It's a trick teachers use to teach word classes to their students. Here's an example:
What we're going to be doing with the Nearley grammar, Midsomer Murders and Beatles is actually not that dissimilar. We're going to keep some things fixed, and then rotate over other items to create new unique lines. Example: generating Beatles-lyrics
Let’s consider the song "Hey Jude" by the Beatles. It was written by Paul McCartney in 1968 and was at the time the longest single ever to top the British charts. The single has sold approximately eight million copies and is frequently included on professional critics' lists of the greatest songs of all time. In 2013, Billboard named it the 10th "biggest" song of all time.


hey-jude-flow-chart.png

The song's original title was "Hey Jules", and it was intended to comfort Julian Lennon from the stress of his parents' divorce. McCartney later said, "I knew it was not going to be easy for him", and that he changed the name to "Jude" "because I thought that sounded a bit better"


The song features quite simple lyrics, someone even drew a flowchart to represent it. If we wanted to render the "na" of this song, we’d want either one or more than one (potentially an unlimited number). This would be expressed like this:
"na":+


In this example we're going to be mixing terminals and non-terminals. The terminals will be the fixed elements, and the non-terminals will predict what we will fill the "gaps" with. We can generate any line of the song, according to this flowchart, like this:


SONGVERSE -> "Hey Jude, don’t " LINE1 ". Remember to " LINE2 " then you " LINE3 " to make it better. Better better better better better waaaaaa" NA
LINE1 -> "make it bad, take a sad song and make it better" | "be afraid, you were made to go out and get her" | "let me down, you have found her, now go and get her"
LINE2 -> "let her into your heart" | "let her under your skin"
LINE3 -> "can start" | "begin"
NA -> " na":+


Copy and paste this bit of code into the Nearley parser playground, and click "generate" and off we go!


Example: generating the summary of a Midsummer Murders plot
Let’s do another example, this time, we’re going to try to impersonate the hilarious "Midsomer Murders Bot" on twitter. The Midsomer Murders TV show has been running since the 1997, telling the stories of the seemingly never ending stream of murders on the English countryside. The show is known for being a bit silly sometimes, and also being a bit templatic. This little grammar right here will produce funny plot summaries of the popular British murder mystery show "Midsomer Murders", à la the bot on twitter.


MIDSOMER_MURDERS_PLOT -> LOCAL PROFESSION FOUND DEAD PLACE SUSPICION SUSPECT ANGRY BLAMED THREAT THREATENED
LOCAL -> "A local "
PROFESSION -> "linguist" | "philosopher" | "novelist"
FOUND -> " is found "
DEAD -> "drowned" | "strangled" | "dead" | "hanged" | "battered" | "suffocated" | "shot"
PLACE -> " in the coffee shop." |" in the swimming pool." | " after band practice."| " behind the primary school."
SUSPICION -> " Suspicion falls on the village "
SUSPECT -> "baker" | "pastor" | "mailman" | "florist" | "nerd" | "twins"
ANGRY -> ", angry that the "
BLAMED -> "new wind farm" | "pig" | "pub" | "decline in newspaper reading" | "metric system"
THREAT -> " might threaten "
THREATENED -> "the village fabric." | "the Old Inn." | "the cow farm." | "the annual Full Moon party." | "what little sexual tension the village has left."
This gives us, for example:


"A local linguist is found battered in the coffee shop. Suspicion falls on the village nerd, angry that the pub might threaten what little sexual tension the village has left."
or
"A local novelist is found shot after band practice. Suspicion falls on the village mailman, angry that the pub might threaten the Old Inn."
or
"A local philosopher is found dead after band practice. Suspicion falls on the village florist, angry that the decline in newspaper reading might threaten the village fabric."
Just in case you think I'm being silly, here's some more scenes from the show:



Yes, this is a real show and it is brilliant.




Final notes
Finally, there’s just a few more things that are good to know before continuing.


  • The web interface where you write Nearley into has different colors for the text when it’s non-terminals and terminals. This is just to be helpful. It may be that if you put a modifier after a non-terminal, it gets the same color as a terminal.
  • Terminals are strings, and should hence be surrounded by "". Make sure you get the right kind of quotation marks, sometimes they get changed to so called "smart/curly" quotation marks *shudders*
  • Rules need to be written top-down. Rules in Nearley need to be written in such an order that non-terminals that refer to something else need to occur before that something. It’s sort of like a hierarchy of more abstract units all the way down to more concrete. In our earlier example, for example, they need to be in this order:

  1. SENTENCE -> SUBJ VERB OBJECT
  2. SUBJ -> "I "
  3. VERB -> " love "
  4. OBJECT -> "chocolate"

This order wouldn’t work:
  1. SUBJ -> "I "
  2. VERB -> " love "
  3. OBJECT -> "chocolate"
  4. SENTENCE -> SUBJ VERB OBJECT

The things we have covered now are the basics that you need to get started, you’re now able to write some grammars of your own and fool around. Thanks for Kartik Chandra for writing the parser, Guillermo Webster for making the playground. Next time, we’re going to continue to some real-world examples with actual languages Samoan and Arabana!


Test
You said you wanted a test did you? Ok, sure. That's unusual... but lucky I had one prepared!


Consider this little grammar and lexicon
#Grammar
DISNEY_MOVIE_TITLE -> NP CONJ NP PREP PLACE
NP -> ADJ N


#Lexicon
ADJ -> "Pretty "| "Little " | "Sweet " | "Sad " |"Honest "
N -> "Prince" | "Princess" | "Frog" | "Dog" | "Pumpkin"
PREP -> " in" | " outside of" | " behind" | " near" | " at"
CONJ -> " or " | " and "
PLACE -> " Colorado" | " Sea World" | " Disneyland" | " Target"

Are the strings below grammatical or not?

String
Grammatical according to rules above?
A
Pretty Prince or Honest Frog in Colorado

B
Honest Frog near Sea world

C
Sweet and Sad Dog at Target

D
Honest Frog and Sweet Pumpkin near Sea World

E
Prince or Frog in Colorado

F
Sweet Dog and Sad Dog at Target

G
Frog in Colorado


Answers:
A = T, B = F, C = F, D = T, E = F, F = T, G = F.

EDIT That was fun, making rules an a lexicon. But, you might want to draw trees of your sentences too, no? Unfortunately, I don't know of an app that does both. Gothenburg University used to have one, but it is no longer active. But! We can go here and draw specific trees nicely. That's a bit helpful, if not a full solution.


Comments

Popular posts from this blog

A Global Tree of Languages

Language family maps

My ELAN workflow for segmenting and transcription