Parsing Tutorial Part One

Prerequisites: You should have a basic working knowledge of JavaScript and HTML and have read about getting started.

The ISHML parser is a recursive descent parser with backtracking, which is capable of returning multiple interpretation of an input text. This tutorial provides an introduction to parsing with ISHML.

At the heart of it, parsing is about breaking up a larger piece of text into smaller pieces of text and organizing those small pieces in some manner so that they can be used by an application. The smallest possible unit of meaningful text is called a lexeme. A lexeme is often a single word, but might be as small as a single character, or might be a longer string containing spaces. A lexeme along with its definition is called a token. A syntax tree, also sometimes called a parse tree, orgainzes the tokens into a hierarchical structure according to a set of rules called a grammar. If this sounds like diagramming a sentence, you have the right idea.

Analyzing text is a multi-step process. The first step, called lexing or tokenization, breaks up the input text into an array of tokens. The second step, syntactic analysis, applies grammar rules to the tokenization, which results in a syntax tree. The third step, semantic analysis, modifies the branches of the syntax tree to change results that are syntactically correct, but not meaningful, into something meaningful, or to discard non-meaningful alternatives.

Now that we have a general notion of parsing, let's look at parsing in the context of the ISHML API. The API exposes its methods and properties via a global variable, ISHML. This tutorial will examine four ISHML objects: .Lexicon, .Rule, .Parser, and .Interpretation.

The Lexicon

Note: All code listings are included as scripts for this web page which you may examine and interact with through your browser's developer tools.

Let's start by defining a collection of tokens to match against the input text. This is done be creating a lexicon and registering some tokens. Think of the lexicon as a dictionary. Each entry in a dictionary consists of a word and one or more definitions. Each entry in the lexicon is a lexeme paired with one or more definitions. Just as synonyms are allowed in a dictionary, multiple lexemes may share the same definition.

//Code Listing 1-- Definining a lexicon; registering tokens.    
var lexicon=ISHML.Lexicon()
lexicon
lexicon
    .register("the", "a", "an").as({ part: "article" })
    .register("take", "steal", "grab")
        .as({ key: "take", part: "verb", prepositions: ["to", "from"] })
    .register("drop", "leave").as({ key: "drop", part: "verb", prepositions: [] })
    .register("ring").as({ key: "ring", part: "noun", role: "thing" })
    .register("slipper").as({ key: "slipper", part: "noun", role: "thing" })
    .register("diamond").as({ key: "ring", part: "adjective", role: "thing" })
    .register("diamond jim").as({ key: "jim", part: "noun", role: "npc" })
    .register("jim").as({ key: "james", part: "noun", role: "npc" })
    .register("ruby").as({ key: "ring", part: "adjective", role: "thing" })
    .register("ruby").as({ key: "ruby", part: "noun", role: "thing" })
    .register("ruby").as({ key: "slipper", part: "adjective", role: "thing" })
    .register("glass").as({ key: "slipper", part: "adjective", role: "thing" })
    .register("glass").as({ key: "tumbler", part: "noun", role: "thing" })
    .register("looking glass").as({ key: "mirror", part: "noun", role: "thing" })
    .register("good looking").as({ key: "tumbler", part: "adjective", role: "thing" })
    .register("good").as({ key: "mirror", part: "adjective", role: "thing" })
    .register("tumbler").as({ key: "tumbler", part: "noun", role: "thing" })
    .register("ruby").as({ key: "miss_ruby", part: "noun", role: "npc" })
    .register("sam").as({ key: "sam", part: "noun", role: "npc" })
    .register("from").as({ key: "from", part: "preposition" })
    .register("to").as({ key: "to", part: "preposition" })
            

Study Code Listing 1. The first line creates an instance of ISHML.Lexicon. Notice that the new operator is omitted. The new operator is optional. ISHML constructors always return a new object instance when called.

To add a token to the lexicon, the .register() method is used. The .register() method takes one or more lexemes as its arguments and returns the .as() function. The .as() function argument is the definition of the lexeme(s) and may be anything— a simple value or complex object. The .as() function returns the instance of the lexicon so that you can chain the .register() method multiple times.

The definition should be thought of as an arbitrary payload to be retrieved when the lexeme is matched. The definitions in Code Listing 1 are highly simplified. In practice, a definition will typically hold one or more references to objects and functions from your application.

Knowing how to register tokens is all you need to know about the lexicon for most parsing applications. However, the API discusses many additional features for other, more advanced uses of the lexicon.

Referring to Code Listing 1, answer the following questions:

  1. What is the definition for "take"?
  2. What other lexemes share the same definition as "take"?
  3. How many definitions does "ruby" have?
  4. Open your web browser's developer tools. A lexicon identical to Code Listing 1 was created when this page was loaded. Register a token in the lexicon by entering a statement in the command console. What is returned? Why? What are the requirements for the argument passed in .as()?
  1. The definition of "take" is {key:"take",part:"verb"}
  2. "steal" and "grab" share the same definition as "take".
  3. "ruby" has four definitions (ruby, ring, slipper, and Ms Ruby.)
  4. Entering lexicon.register("banana").as({food:"fruit",name:"banana"}) returns the lexicon so that methods may be chained. The argument of .as() is the definition of the token and may be a simple value or a complex object. It is the arbitrary payload of the token.
//How can we struture the input to look like this?
command //root of syntax tree (JavaScript object)
command.verb //token for "take"
command.nounPhrase //non-terminal
commmand.nounPhrase.article // token for "the"
commmand.nounPhrase.adjectives // token for "ruby"
commmand.nounPhrase.noun // token for "slipper"
Figure 1— Command
Verb NounPhrase
Figure 2— Noun Phrase
Article Adjective Noun

The Grammar

We created tokens in the lexicon, but we also need to define the structure of the syntax tree to be returned after parsing. In ISHML, a plain JavaScript object represents the root of the syntax tree. The object's properties are the child nodes of the tree. Child nodes may have children of their own. In other words, the value of a property may also be a plain JavaScript object if the property represents a branch of the tree (a non-terminal node.) All branches of the tree eventually end with the actual token(s) in the terminal nodes.

Let's design a syntax tree for a simple command like "take the ruby slipper." First, we snip that command into a verb followed by a noun phrase. Then, we snip the noun phrase into article, adjectives, and noun.

Now that we know what we are aiming for, we can design the grammar. A grammar is a set of nested rules that describes the syntax tree to be generated during parsing. The structure of the grammar mirrors the structure of the syntax tree. Rules are, in spirit, a JavaScript adaptation of EBNF notation.

//Code Listing 2

//Create a set of nested rules which mirror the wanted syntax tree.
var command = ISHML.Rule()
command.snip("verb").snip("nounPhrase")
command.nounPhrase.snip("article").snip("adjectives").snip("noun")

//Configure behavior of some of the rules with .configure().

command.verb.configure({ filter: (definition) => definition.part === "verb" })

command.nounPhrase.article
.configure({ minimum: 0, filter: (definition) => definition.part === "article" })

command.nounPhrase.adjectives
.configure(
{
minimum: 0, maximum: Infinity,
filter: (definition) => definition.part === "adjective"
})

//alternatively the rule's options may be set directly.
command.nounPhrase.noun.filter=(definition) => definition.part === "noun"
            

Study Code Listing 2. It defines a variable, command, as an ISHML.Rule. This is the root rule. The .snip() method creates two sub-rules, verb and nounPhrase. These are also instances of ISHML.Rule. The command.nounPhrase rule further sub-divides into article, adjective, and noun rules.

Now that we have a set of nested rules that mirrors the basic structure of the syntax tree we want, we can configure the rules. By default, rules return exactly one instance of a branch or token for a given interpretation. In the case of tokens, the default is to return the next token in the tokenization (token list), regardless of how that token is defined.

The default behavior of a rule may be modified with the .configure() method. The .configure() method takes an options object as its argument. It is a plain JavaScript object with properties that are the same as the non-enumerable properties of ISHML.Rule.

We'll look at just a few of the available options. Check the API for more. Let's start with filter. Recall that a single lexeme may have multiple definitions in the lexicon. When a token is retrieved from the lexicon, the definitions are exposed as an array in the .definitions property of the token. We use the filter option to specify a function that will return true for valid definitions and false for invalid definitions. This will filter out the definitions that do not apply in the context of the grammar rule. If the filter returns false for all the definitions of the token, the token is undefined in the context of the rule and the current interpretation is not valid.

The minimum option specifies the mininum number of times to repeat the rule when parsing the input. It defaults to 1. Set to 0 to make the rule optional. The maximum option specifies the maximum number of times to repeat a rule. It defaults to 1. Set it to Infinity to repeat the rule an indefinite number of times. If the parser is able to repeat a rule any number of times between the minimum and maximum inclusively the rule is deemed successful and the interpretation is accepted as valid.

Referring to Code Listing 2, answer the following questions:

  1. Which rule is the root rule? Why?
  2. Which rules generate terminal nodes and which generate non-terminal nodes? Why?
  3. How would you write a statement to make the nounPhrase rule optional?
  4. How would you write a statement to allow the nounPhrase to appear an indefinite number of times?
  5. Why is it preferrable to use .configure to set the rules options, rather than setting them directly?
  6. How would you write a filter function to limit noun rule matches to just "things?"
  7. Can you write a filter for the nounPhrase rule? Why or why not?
  1. The command rule is a root rule. It was created with ISHML.Rule(), not .snip() and that's a pretty good sign that we intead it to be the root rule. The more accurate answer is that root rule can be any rule that we use to configure the parser, but we haven't talked about that yet...
  2. The command.nounPhrase and also command rules are non-terminal rules because they have sub-rules defined for them with .snip(). The other rules are not "snipped" and therefore are terminal rules. The output of a non-terminal rule is a branch of the syntax tree. The output of a terminal rule is one or more tokens.
  3. command.nounPhrase.configure{minimum:0} Or, command.nounPhrase.minimum=0
  4. command.nounPhrase.maximum=Infinity Or, to make the nounPhrase also optional: command.nounPhrase.configure{minimum:0,maximum:Infinity}
  5. Either way works, but using .configure() lets you set multiple options in one statement. It also returns the instance of ISHML.Rule so you can use method chaining.
  6. command.nounPhrase.noun.filter=(definition) => definition.part === "noun" && definition.role="thing"
  7. No, filters only apply to tokens in terminal nodes. The command.nounPhrase rule defines a non-terminal node. Part two of this tutorial covers setting a rule's semantics option for validation of non-terminal nodes.

The Parser

//Code Listing 3

//Create a parser 
var parser=ISHML.Parser({lexicon:lexicon,grammar:command})
//Code Listing 4

var example1 = parser.analyze("take the ruby slipper")

var example2 = parser.analyze("Take ruby slipper.",{separator:/[\.|\s]/})

var example3 = parser.analyze("Take take ruby slipper.",{separator:/[\.|\s]/})

var example4 = parser.analyze("Take ruby slipper take.",{separator:/[\.|\s]/})

var example5 = parser.analyze("Take the really pretty ruby slipper.",{separator:/[\.|\s]/})

Study Code Listing 3. Now that we've gone through the hard work of making a lexicon and a grammar, we are ready to define the parser. The constructor for ISHML.Parser takes an options argument. The lexicon from Code Listing 1 is used for the lexicon option and the command rule from Code Listing 2 is used for the grammar option. We are now finally ready to do some parsing.

Code Listing 4 shows the parser in action. The .analyze() method applies the grammar rule and the lexicon to the input text and returns the result of the parsing. It takes the input text as the first argument. The second argument is an options object.

The input text is first tokenized using the lexicon specified when the parser was constructed. Depending on the lexemes in the lexicon and the content of the input text, it may be possible that tokenizing may result in more than one valid tokenization. That's OK. The ISHML parser is designed to handle multiple alternative tokenizations. After tokenization, the grammar rules are applied and the results are returned as a plain JavaScript object. The properies of this object are dependent on the results of tokenization and parsing.

If the input text was successfully tokenized and parsed, the .success property has a value of true. The .interpretations property contains an array of ISHML.Interpretation. The .gist property of each interpretation contains the syntax tree generated from parsing. The structure of .gist follows the structure of the grammar rules used to create it. The nodes of the syntax tree are properties named after the sub-rules that define them. These properties may in turn have other properties defined on them all the way down the syntax tree to the terminal nodes of the tree. A terminal property contains the matching token for the sub-rule. In the case where a sub-rule defines the maximum number of tokens to match to be more than one, the value stored in the terminal property is an array of tokens. The .remainder property of an interpretation is an empty array because a successful parsing consumes all the tokens in the tokenization.

If the input text failed tokenization or parsing, the .success property has a value of false. If tokenization failed, the .tokenizations property contains an array of partial tokenizations, which can be used to give feedback on why and where tokenization failed. The .interpretations property is undefined.

If tokenization was successful, but parsing was not, the .tokenizations property is undefined and the .interprations property contains an array of partial interpretations, which can be used to give feedback on why and where parsing failed. A partial interpretation is an interpretation that parses successfully against the grammar, but does not consume all the tokens in the tokenization.

There are many options for .analyze(). We'll only cover a few of them. See the API for the complete list.

Example 1 shows the most basic use. By default, the tokenizer expects whitespace between lexemes and ignores it. In example 2 the separator option has been changed to ignore both whitespace and periods. Any RegExp may be used as a separator. For no separation at all, set the separator to the empty string.

Example 3 is lexically correct, but syntactically incorrect due to the doubled verb. No interpretation can be generated. Example 4, with the second instance of "take" moved to the end, results in a partial interpretation since it is syntactically correct if the last token is not considered.

Example 5 is lexically incorrect because "really pretty" is not registered in the lexicon. The result is a partial tokenization.

Open your browser's developer tools, refer to Code Listing 4, and try the following exercises:

  1. Enter console.log(example1) in the command console. Did example1 parse successfully? How many interpretations were returned?
    1. Why does example1.interpretations[0].remainder.tokens contain an empty array?
    2. What is example1.interpretations[0].gist?
      1. Enter console.log(example1.interpretations[0].gist). In what order are the properties of .gist listed?
      2. Enter Object.keys(example1.interpretations[0].gist). In what order are the properties of .gist listed?
    3. Why is the value of example1.interpretations[0].gist.nounPhrase.adjectives an array containing a token instead of just the token itself? (Hint: review Code Listing 2.)
  2. Enter console.log(example2) in the command console. Did example2 parse successfully? How many interpretations were returned?
    1. Why is the period character in the input text accepted as valid?
    2. Why is it valid to omit the word "the" in the input text?
    3. How is the syntax tree affected by the missing article in the input text?
  3. Enter console.log(example3) in the command console. Did example3 parse successfully? Why or why not?
    1. Did the input text generate any interpretations?
    2. Did the input text successfully tokenize?
  4. Enter console.log(example4) in the command console. Did example4 parse successfully? Why or why not?
    1. How many interpretations were generated? Why?
    2. In the case of a partial interpretation, are the leftover tokens available?
  5. Enter console.log(example5) in the command console. Did example5 parse successfully? Why or why not?
  1. Yes, the value of example1.success is true. One interpretation was returned in the array, example1.interpretations
    1. A successful interpretation consumes all the tokens in the tokenization.
    2. The syntax tree is contained in example1.interpretations[0].gist. The .gist property of the interpretation is the root node of the syntax tree.
      1. Alphabetically.
      2. All modern browsers iterate an object's own enumerable properties by starting with integer keys in numerical order followed by the order in which the properties are created. Oddly, console.log(), as implemented in browsers, still insists on alphabetizing object properties for your convenience, which is confusing, but harmless.)
    3. The maximum option for the .adjectives rule was set to more than 1.
  2. Yes, the value of example2.success is true. One interpretation was returned in the array, example2.interpretations
    1. The period was part of the regular expression assigned to the separator option when parser.analyze() was called.
    2. The minimum option for the article rule was set to 0.
    3. There is no article property in the syntax tree. example3.interpretations[0].gist.article is undefined.
  3. No, the value of example3.success is false.
    1. No, the value of example3.interpretations is undefined.
    2. Yes, example3.tokenizations contains an array (length 1) of tokenizations. The list of tokens is example3.tokenizations[0].tokens.
  4. No, the value of example4.success is false.
    1. Two partial interpretations were generated: "take ruby slipper", where ruby is an adjective, and "take ruby", where ruby is a noun. These interpretations fit the grammar, but they only partially consome the list of tokens in the tokenization.
    2. The leftover tokens remaining are available as a tokenization in an interpretation's .remainder property. For example, example4.interpretations[0].remainder.
  5. No, the value of example5.success is false. The attempts at tokenization are available in example5.tokenizations. For instance, example5.tokenizations[0].tokens contains an array of tokens for the partially tokenized input text and example5.tokenizations[0].remainder contains the portion of the input text that could not be tokenized. By the way, smooth and fuzzy options could have been used to handle lexemes not found in lexicon. The smooth option skips over lexemes not found in the lexcion. The fuzzy option creates a token with a definition of "fuzz".

Continue to part two to learn about more complex grammars, including choice rules and semantic analysis.