Parsing Tutorial Part One

Prerequisites: You should have a basic working knowledge of JavaScript and HTML and have read about getting started.

All code listings are included as scripts for this web page which you may examine and interact with through your browser's developer tools.

The ISHML parser is a recursive descent parser with backtracking, which is capable of returning multiple interpretations of an input text. This tutorial provides an introduction to parsing with ISHML.

At the heart of it, parsing is about breaking up a larger piece of text into smaller pieces of text and organizing those small pieces in some manner so that they can be used by an application. The smallest possible unit of meaningful text is called a lexeme. A lexeme is often a single word, but might be as small as a single character, or it might be a longer string containing spaces. A lexeme along with its definition is called a token. A syntax tree, also sometimes called a parse tree, orgainzes the tokens into a hierarchical structure according to a set of rules called a grammar. If this sounds like diagramming a sentence, you have the right idea.

Analyzing text requires both syntactic analysis and semantic analysis. Syntactic analysis applies grammar rules to generate a syntax tree. Semantic analysis, modifies the branches of the syntax tree to change results that are syntactically correct, but not meaningful, into something meaningful, or to discard non-meaningful alternatives.

Now that we have a general notion of parsing, let's look at parsing in the context of the ISHML API. The API exposes its methods and properties via a global variable, ishml. This tutorial will examine four ishml objects: .Lexicon, .Rule, .Parser, and .Interpretation.

The Lexicon

Let's start by defining a collection of tokens to match against the input text. This is done be creating a lexicon and registering some tokens. Think of the lexicon as a dictionary. Each entry in a dictionary consists of a word and one or more definitions. Each entry in the lexicon is a lexeme paired with one or more definitions. Just as synonyms are allowed in a dictionary, multiple lexemes may share the same definition.

//Code Listing 1-- Definining a lexicon; registering tokens.    
var lexicon = ishml.Lexicon()
lexicon
    .register("the", "a", "an").as({ part: "article" })
    .register("take", "steal", "grab")
        .as({ key: "take", part: "verb", prepositions: ["to", "from"] })
    .register("drop", "leave").as({ key: "drop", part: "verb", prepositions: [] })
    .register("ring").as({ key: "ring", part: "noun", role: "thing" })
    .register("slipper").as({ key: "slipper", part: "noun", role: "thing" })
    .register("diamond").as({ key: "ring", part: "adjective", role: "thing" })
    .register("diamond jim").as({ key: "jim", part: "noun", role: "npc" })
    .register("jim").as({ key: "james", part: "noun", role: "npc" })
    .register("ruby").as({ key: "ring", part: "adjective", role: "thing" })
    .register("ruby").as({ key: "ruby", part: "noun", role: "thing" })
    .register("ruby").as({ key: "slipper", part: "adjective", role: "thing" })
    .register("glass").as({ key: "slipper", part: "adjective", role: "thing" })
    .register("glass").as({ key: "tumbler", part: "noun", role: "thing" })
    .register("looking glass").as({ key: "mirror", part: "noun", role: "thing" })
    .register("good looking").as({ key: "tumbler", part: "adjective", role: "thing" })
    .register("good").as({ key: "mirror", part: "adjective", role: "thing" })
    .register("tumbler").as({ key: "tumbler", part: "noun", role: "thing" })
    .register("ruby").as({ key: "miss_ruby", part: "noun", role: "npc" })
    .register("sam").as({ key: "sam", part: "noun", role: "npc" })
    .register("from").as({ key: "from", part: "preposition" })
    .register("to").as({ key: "to", part: "preposition" })

Study Code Listing 1. The first line creates an instance of ishml.Lexicon. Notice that the new operator is omitted. The new operator is optional for most ISHML constructors.

To add a token to the lexicon, the .register() method is used. The .register() method takes one or more lexemes as its arguments and returns the .as() function. The .as() function argument is the definition of the lexeme(s) and may be anything— a simple value or complex object. The .as() function returns the instance of the lexicon so that you can chain the .register() method multiple times.

The definition should be thought of as an arbitrary payload to be retrieved when the lexeme is matched. The definitions in Code Listing 1 are highly simplified. In practice, a definition will typically hold one or more references to objects and functions from your application.

Knowing how to register tokens is all you need to know about the lexicon for most parsing applications. However, the API discusses many additional features for other, more advanced uses of the lexicon.

Referring to Code Listing 1, answer the following questions:

What is the definition for "take"?
What other lexemes share the same definition as "take"?
How many definitions does "ruby" have?
Open your web browser's developer tools. A lexicon identical to Code Listing 1 was created when this page was loaded. Register a token in the lexicon by entering a statement in the command console. What is returned? Why? What are the requirements for the argument passed in .as()?

The definition of "take" is {key:"take",part:"verb"}
"steal" and "grab" share the same definition as "take".
"ruby" has four definitions (ruby, ring, slipper, and Ms Ruby.)
Entering lexicon.register("banana").as({food:"fruit",name:"banana"}) returns the lexicon so that methods may be chained. The argument of .as() is the definition of the token and may be a simple value or a complex object. It is the arbitrary payload of the token.

//How can we struture the input to look like this?
command //root of syntax tree (JavaScript object)
command.verb //token for "take"
command.nounPhrase //non-terminal
commmand.nounPhrase.article // token for "the"
commmand.nounPhrase.adjectives // token for "ruby"
commmand.nounPhrase.noun // token for "slipper"

Figure 1— Command

Figure 2— Noun Phrase

The Grammar

We created tokens in the lexicon, but we also need to define the structure of the syntax tree to be returned after parsing. In ISHML, a plain JavaScript object represents the root of the syntax tree. The object's properties are the child nodes of the tree. Child nodes may have children of their own. In other words, the value of a property may also be a plain JavaScript object if the property represents a branch of the tree (a non-terminal node.) All branches of the tree eventually end with the actual token(s) in the terminal nodes.

Let's design a syntax tree for a simple command like "take the ruby slipper." First, we snip that command into a verb followed by a noun phrase. Then, we snip the noun phrase into article, adjectives, and noun.

Now that we know what we are aiming for, we can design the grammar. A grammar is a set of nested rules that describes the syntax tree to be generated during parsing. The structure of the grammar mirrors the structure of the syntax tree. Rules are, in spirit, a JavaScript adaptation of EBNF notation.

//Code Listing 2

//Create a set of nested rules which mirror the wanted syntax tree.
var command = ishml.Rule()
command.snip("verb").snip("nounPhrase")
command.nounPhrase.snip("article").snip("adjectives").snip("noun")

//Configure behavior of some of the rules with .configure().

command.verb.configure({ filter: (definition) => definition.part === "verb" })

command.nounPhrase.article
    .configure({ minimum: 0, filter: (definition) => definition.part === "article" })

command.nounPhrase.adjectives
    .configure(
        {
            minimum: 0, maximum: Infinity,
            filter: (definition) => definition.part === "adjective"
        })

//alternatively the rule's options may be set directly.
command.nounPhrase.noun.filter = (definition) => definition.part === "noun"

Study Code Listing 2. It defines a variable, command, as an ishml.Rule. This is the root rule. The .snip() method creates two sub-rules, verb and nounPhrase. These are also instances of ishml.Rule. The command.nounPhrase rule further sub-divides into article, adjective, and noun rules.

Now that we have a set of nested rules that mirrors the basic structure of the syntax tree we want, we can configure the rules. By default, rules return exactly one instance of a branch or token for a given interpretation. In the case of tokens, the default is to return the next token matching the head of the input text.

The default behavior of a rule may be modified with the .configure() method. The .configure() method takes an options object as its argument. The options object is a plain JavaScript object with properties that are the same as the non-enumerable properties of ishml.Rule.

We'll look at just a few of the available options. Check the API for more. Let's start with filter. Recall that a single lexeme may have multiple definitions in the lexicon. When a token is retrieved from the lexicon, all the definitions for it are exposed as an array in the .definitions property of the token. We use the filter option to specify a function that will return true for valid definitions and false for invalid definitions. This will filter out the definitions that do not apply in the context of the grammar rule. If the filter returns false for all the definitions of the token, the token is undefined in the context of the rule and the current interpretation is not valid.

The minimum option specifies the mininum number of times to repeat the rule when parsing the input. It defaults to 1. Set minimum to 0 to make the rule optional. The maximum option specifies the maximum number of times to repeat a rule. It defaults to 1. Set it to Infinity to repeat the rule an indefinite number of times. If the parser is able to repeat a rule any number of times between the minimum and maximum inclusively the rule is deemed successful and the interpretation is accepted as valid.

Referring to Code Listing 2, answer the following questions:

Which rule is the root rule? Why?
Which rules generate terminal nodes and which generate non-terminal nodes? Why?
How would you write a statement to make the nounPhrase rule optional?
How would you write a statement to allow the nounPhrase to appear an indefinite number of times?
Why is it preferrable to use .configure to set the rules options, rather than setting them directly?
How would you write a filter function to limit noun rule matches to just "things?"
Can you write a filter for the nounPhrase rule? Why or why not?

The command rule is a root rule. It was created with ishml.Rule(), not .snip() and that's a pretty good sign that we intead it to be used as a root rule. The more accurate answer is that root rule can be any rule that we use to configure the parser, but we haven't talked about the parser yet...
The command.nounPhrase and also command rules are non-terminal rules because they have sub-rules defined for them with .snip(). The other rules are not "snipped" and therefore are terminal rules. The output of a non-terminal rule is a branch of the syntax tree. The output of a terminal rule is one or more tokens.
command.nounPhrase.configure{minimum:0} Or, command.nounPhrase.minimum=0
command.nounPhrase.maximum=Infinity Or, to make the nounPhrase also optional: command.nounPhrase.configure{minimum:0,maximum:Infinity}
Either way works, but using .configure() lets you set multiple options in one statement. It also returns the instance of ishml.Rule so you can use method chaining.
command.nounPhrase.noun.filter=(definition) => definition.part === "noun" && definition.role="thing"
No, filters only apply to tokens in terminal nodes. The command.nounPhrase rule defines a non-terminal node. Part two of this tutorial covers setting a rule's semantics option for validation of non-terminal nodes.

The Parser

//Code Listing 3

//Create a parser 
var parser = ishml.Parser({ lexicon: lexicon, grammar: command }))

//Code Listing 4

var example1 = parser.analyze("take the ruby slipper")

var example2 = parser.analyze("Take ruby slipper")

var example3 = parser.analyze("Take slipper")

var example4 = parser.analyze("Take ruby slipper take")

var example5 = parser.analyze("Take the really pretty ruby slipper")

Study Code Listing 3. Now that we've gone through the hard work of making a lexicon and a grammar, we are ready to define the parser. The constructor for ishml.Parser takes an options argument. The lexicon from Code Listing 1 is used for the lexicon option and the command rule from Code Listing 2 is used for the grammar option. We are now finally ready to do some parsing.

Code Listing 4 shows the parser in action. The .analyze() method applies the parser's grammar and lexicon to the input text and returns the result of the parsing.

The results are returned as a plain JavaScript object. If the input text was successfully parsed, the .success property has a value of true. The .interpretations property contains an array of ishml.Interpretation. The .gist property of each interpretation contains the syntax tree generated from parsing. The structure of .gist follows the structure of the grammar rules used to create it. The nodes of the syntax tree are properties named after the sub-rules that define them. These properties may in turn have other properties defined on them all the way down the syntax tree to the terminal nodes of the tree. A terminal property contains the matching token for the sub-rule. In the case where a sub-rule defines the maximum number of tokens to match to be more than one, the value stored in the terminal property is an array of tokens. The .remainder property of an interpretation is an empty string because a successful parsing consumes all the text.

If the input text fails parsing, the .success property has a value of false. Depending on how the root rule was configured, you may also receive some partial interpretations of the input when parsing fails. For more details see part two of this tutorial, which covers advanced error handling techniques.

Open your browser's developer tools, refer to Code Listing 4, and try the following exercises:

Enter console.log(example1) in the command console. Did example1 parse successfully? How many interpretations were returned?
1. Why does example1.interpretations[0].remainder contain an empty string?
2. What is example1.interpretations[0].gist?
  1. Enter console.log(example1.interpretations[0].gist). In what order are the properties of .gist listed?
  2. Enter Object.keys(example1.interpretations[0].gist). In what order are the properties of .gist listed?
3. Why is the value of example1.interpretations[0].gist.nounPhrase.adjectives an array containing a token instead of just the token itself? (Hint: review Code Listing 2.)
Enter console.log(example2) in the command console. Did example2 parse successfully?
1. Why is it valid to omit the word "the" in the input text?
2. How is the syntax tree affected by the missing article in the input text?
Enter console.log(example3) in the command console. Did example3 parse successfully? Why or why not?
Enter console.log(example4) in the command console. Did example4 parse successfully? Why or why not?
Enter console.log(example5) in the command console. Did example5 parse successfully? Why or why not?

Yes, the value of example1.success is true. Two interpretations were returned in the array, example1.interpretations. In the first interpretation ruby was defined as referencing the slipper. In the second interpretation it was defined as referencing the ring. From a syntactic point of view both interpretations are valid, but from a semantic point of view only the first interpretation defines ruby correctly. In part two we'll learn how to write a semantics function to narrow the interpretations to the correct definition.
1. A successful interpretation consumes all input text.
2. The syntax tree is contained in example1.interpretations[0].gist. The .gist property of the interpretation is the root node of the syntax tree.
  1. Depending on the version of the browser you are using, they may be arranged alphabetically.
  2. All modern browsers iterate an object's own enumerable properties by starting with integer keys in numerical order followed by the order in which the properties are created. Oddly, console.log(), as implemented in some browsers, still insists on alphabetizing object properties for your convenience, which is confusing, but harmless.)
3. The maximum option for the .adjectives rule was set to more than 1.
Yes, the value of example2.success is true.
1. The minimum option for the article rule was set to 0.
2. There is no article property in the syntax tree. example3.interpretations[0].gist.article is undefined.
Yes, the value of example3.success is true. Both the article and adjectives are optional for the noun phrase.
No, the value of example4.success is false.
1. Two partial interpretations were generated with ruby interpreted as an adjective, take ruby slipper, and two where ruby is a a noun, "take ruby." These interpretations fit the grammar, but they only partially consumme the input text.
2. The leftover input text is in the interpretation's .remainder property, example4.interpretations[0].remainder. When parsing fails, all attempted interpretations are offered, with the array sorted from longest to shortest remainder.
No, the value of example5.success is false. The input text does not match the command rule. The text really pretty does not match the adjectives rule or the noun rule. Therefore, the nounPhrase rule fails, which causes the command rule to fail. By the way, it is possible to match text not found in the lexicon by using the regex option. Regular expressions are cover in part two of this tutorial.

Continue to part two to learn about more complex grammars, including choice rules and semantic analysis.