Prerequisites: You should have a basic working knowledge of JavaScript and HTML and have read about getting started.
All code listings are included as scripts for this web page which you may examine and interact with through your browser's developer tools.
The ISHML parser is a recursive descent parser with backtracking, which is capable of returning multiple interpretations of an input text. This tutorial provides an introduction to parsing with ISHML.
At the heart of it, parsing is about breaking up a larger piece of text into smaller pieces of text and organizing those small pieces in some manner so that they can be used by an application. The smallest possible unit of meaningful text is called a lexeme. A lexeme is often a single word, but might be as small as a single character, or it might be a longer string containing spaces. A lexeme along with its definition is called a token. A syntax tree, also sometimes called a parse tree, orgainzes the tokens into a hierarchical structure according to a set of rules called a grammar. If this sounds like diagramming a sentence, you have the right idea.
Analyzing text requires both syntactic analysis and semantic analysis. Syntactic analysis applies grammar rules to generate a syntax tree. Semantic analysis, modifies the branches of the syntax tree to change results that are syntactically correct, but not meaningful, into something meaningful, or to discard non-meaningful alternatives.
Now that we have a general notion of parsing, let's look at parsing in the context of the ISHML API. The API exposes its methods and properties via a global variable, ishml
. This tutorial will examine four ishml
objects: .Lexicon
, .Rule
, .Parser
, and .Interpretation
.
Let's start by defining a collection of tokens to match against the input text. This is done be creating a lexicon and registering some tokens. Think of the lexicon as a dictionary. Each entry in a dictionary consists of a word and one or more definitions. Each entry in the lexicon is a lexeme paired with one or more definitions. Just as synonyms are allowed in a dictionary, multiple lexemes may share the same definition.
//Code Listing 1-- Definining a lexicon; registering tokens. var lexicon = ishml.Lexicon() lexicon .register("the", "a", "an").as({ part: "article" }) .register("take", "steal", "grab") .as({ key: "take", part: "verb", prepositions: ["to", "from"] }) .register("drop", "leave").as({ key: "drop", part: "verb", prepositions: [] }) .register("ring").as({ key: "ring", part: "noun", role: "thing" }) .register("slipper").as({ key: "slipper", part: "noun", role: "thing" }) .register("diamond").as({ key: "ring", part: "adjective", role: "thing" }) .register("diamond jim").as({ key: "jim", part: "noun", role: "npc" }) .register("jim").as({ key: "james", part: "noun", role: "npc" }) .register("ruby").as({ key: "ring", part: "adjective", role: "thing" }) .register("ruby").as({ key: "ruby", part: "noun", role: "thing" }) .register("ruby").as({ key: "slipper", part: "adjective", role: "thing" }) .register("glass").as({ key: "slipper", part: "adjective", role: "thing" }) .register("glass").as({ key: "tumbler", part: "noun", role: "thing" }) .register("looking glass").as({ key: "mirror", part: "noun", role: "thing" }) .register("good looking").as({ key: "tumbler", part: "adjective", role: "thing" }) .register("good").as({ key: "mirror", part: "adjective", role: "thing" }) .register("tumbler").as({ key: "tumbler", part: "noun", role: "thing" }) .register("ruby").as({ key: "miss_ruby", part: "noun", role: "npc" }) .register("sam").as({ key: "sam", part: "noun", role: "npc" }) .register("from").as({ key: "from", part: "preposition" }) .register("to").as({ key: "to", part: "preposition" })
Study Code Listing 1. The first line creates an instance of ishml.Lexicon
. Notice that the new
operator is omitted. The new operator is optional for most ISHML constructors.
To add a token to the lexicon, the .register()
method is used. The .register()
method takes one or more lexemes as its arguments and returns the .as()
function. The .as()
function argument is the definition of the lexeme(s) and may be anything— a simple value or complex object. The .as()
function returns the instance of the lexicon so that you can chain the .register()
method multiple times.
The definition should be thought of as an arbitrary payload to be retrieved when the lexeme is matched. The definitions in Code Listing 1 are highly simplified. In practice, a definition will typically hold one or more references to objects and functions from your application.
Knowing how to register tokens is all you need to know about the lexicon for most parsing applications. However, the API discusses many additional features for other, more advanced uses of the lexicon.
Referring to Code Listing 1, answer the following questions:
.as()
?{key:"take",part:"verb"}
lexicon.register("banana").as({food:"fruit",name:"banana"})
returns the lexicon so that methods may be chained. The argument of .as()
is the definition of the token and may be a simple value or a complex object. It is the arbitrary payload of the token.//How can we struture the input to look like this? command //root of syntax tree (JavaScript object) command.verb //token for "take" command.nounPhrase //non-terminal commmand.nounPhrase.article // token for "the" commmand.nounPhrase.adjectives // token for "ruby" commmand.nounPhrase.noun // token for "slipper"
We created tokens in the lexicon, but we also need to define the structure of the syntax tree to be returned after parsing. In ISHML, a plain JavaScript object represents the root of the syntax tree. The object's properties are the child nodes of the tree. Child nodes may have children of their own. In other words, the value of a property may also be a plain JavaScript object if the property represents a branch of the tree (a non-terminal node.) All branches of the tree eventually end with the actual token(s) in the terminal nodes.
Let's design a syntax tree for a simple command like "take the ruby slipper." First, we snip that command into a verb followed by a noun phrase. Then, we snip the noun phrase into article, adjectives, and noun.
Now that we know what we are aiming for, we can design the grammar. A grammar is a set of nested rules that describes the syntax tree to be generated during parsing. The structure of the grammar mirrors the structure of the syntax tree. Rules are, in spirit, a JavaScript adaptation of EBNF notation.
//Code Listing 2 //Create a set of nested rules which mirror the wanted syntax tree. var command = ishml.Rule() command.snip("verb").snip("nounPhrase") command.nounPhrase.snip("article").snip("adjectives").snip("noun") //Configure behavior of some of the rules with .configure(). command.verb.configure({ filter: (definition) => definition.part === "verb" }) command.nounPhrase.article .configure({ minimum: 0, filter: (definition) => definition.part === "article" }) command.nounPhrase.adjectives .configure( { minimum: 0, maximum: Infinity, filter: (definition) => definition.part === "adjective" }) //alternatively the rule's options may be set directly. command.nounPhrase.noun.filter = (definition) => definition.part === "noun"
Study Code Listing 2. It defines a variable, command
, as an ishml.Rule
. This is the root rule. The .snip()
method creates two sub-rules, verb
and nounPhrase
. These are also instances of ishml.Rule
. The command.nounPhrase
rule further sub-divides into article
, adjective
, and noun
rules.
Now that we have a set of nested rules that mirrors the basic structure of the syntax tree we want, we can configure the rules. By default, rules return exactly one instance of a branch or token for a given interpretation. In the case of tokens, the default is to return the next token matching the head of the input text.
The default behavior of a rule may be modified with the .configure()
method. The .configure()
method takes an options object as its argument. The options object is a plain JavaScript object with properties that are the same as the non-enumerable properties of ishml.Rule.
We'll look at just a few of the available options. Check the API for more. Let's start with filter
. Recall that a single lexeme may have multiple definitions in the lexicon. When a token is retrieved from the lexicon, all the definitions for it are exposed as an array in the .definitions
property of the token. We use the filter
option to specify a function that will return true
for valid definitions and false
for invalid definitions. This will filter out the definitions that do not apply in the context of the grammar rule. If the filter returns false
for all the definitions of the token, the token is undefined in the context of the rule and the current interpretation is not valid.
The minimum
option specifies the mininum number of times to repeat the rule when parsing the input. It defaults to 1. Set minimum
to 0 to make the rule optional. The maximum
option specifies the maximum number of times to repeat a rule. It defaults to 1. Set it to Infinity
to repeat the rule an indefinite number of times. If the parser is able to repeat a rule any number of times between the minimum
and maximum
inclusively the rule is deemed successful and the interpretation is accepted as valid.
Referring to Code Listing 2, answer the following questions:
.configure
to set the rules options, rather than setting them directly?
command
rule is a root rule. It was created with ishml.Rule()
, not .snip()
and that's a pretty good sign that we intead it to be used as a root rule. The more accurate answer is that root rule can be any rule that we use to configure the parser, but we haven't talked about the parser yet...command.nounPhrase
and also command
rules are non-terminal rules because they have sub-rules defined for them with .snip()
. The other rules are not "snipped" and therefore are terminal rules. The output of a non-terminal rule is a branch of the syntax tree. The output of a terminal rule is one or more tokens.command.nounPhrase.configure{minimum:0}
Or, command.nounPhrase.minimum=0
command.nounPhrase.maximum=Infinity
Or, to make the nounPhrase also optional: command.nounPhrase.configure{minimum:0,maximum:Infinity}
.configure()
lets you set multiple options in one statement. It also returns the instance of ishml.Rule
so you can use method chaining.command.nounPhrase.noun.filter=(definition) => definition.part === "noun" && definition.role="thing"
command.nounPhrase
rule defines a non-terminal node. Part two of this tutorial covers setting a rule's semantics option for validation of non-terminal nodes.//Code Listing 3 //Create a parser var parser = ishml.Parser({ lexicon: lexicon, grammar: command }))
//Code Listing 4 var example1 = parser.analyze("take the ruby slipper") var example2 = parser.analyze("Take ruby slipper") var example3 = parser.analyze("Take slipper") var example4 = parser.analyze("Take ruby slipper take") var example5 = parser.analyze("Take the really pretty ruby slipper")
Study Code Listing 3. Now that we've gone through the hard work of making a lexicon and a grammar, we are ready to define the parser. The constructor for ishml.Parser
takes an options argument. The lexicon from Code Listing 1 is used for the lexicon
option and the command rule from Code Listing 2 is used for the grammar
option. We are now finally ready to do some parsing.
Code Listing 4 shows the parser in action. The .analyze()
method applies the parser's grammar and lexicon to the input text and returns the result of the parsing.
The results are returned as a plain JavaScript object. If the input text was successfully parsed, the .success
property has a value of true
. The .interpretations
property contains an array of ishml.Interpretation
. The .gist
property of each interpretation contains the syntax tree generated from parsing. The structure of .gist
follows the structure of the grammar rules used to create it. The nodes of the syntax tree are properties named after the sub-rules that define them. These properties may in turn have other properties defined on them all the way down the syntax tree to the terminal nodes of the tree. A terminal property contains the matching token for the sub-rule. In the case where a sub-rule defines the maximum number of tokens to match to be more than one, the value stored in the terminal property is an array of tokens. The .remainder
property of an interpretation is an empty string because a successful parsing consumes all the text.
If the input text fails parsing, the .success
property has a value of false
. Depending on how the root rule was configured, you may also receive some partial interpretations of the input when parsing fails. For more details see part two of this tutorial, which covers advanced error handling techniques.
Open your browser's developer tools, refer to Code Listing 4, and try the following exercises:
console.log(example1)
in the command console. Did example1
parse successfully? How many interpretations were returned?
example1.interpretations[0].remainder
contain an empty string?example1.interpretations[0].gist
?
console.log(example1.interpretations[0].gist)
. In what order are the properties of .gist
listed?Object.keys(example1.interpretations[0].gist)
. In what order are the properties of .gist
listed?example1.interpretations[0].gist.nounPhrase.adjectives
an array containing a token instead of just the token itself? (Hint: review Code Listing 2.)console.log(example2)
in the command console. Did example2
parse successfully?
console.log(example3)
in the command console. Did example3
parse successfully? Why or why not?
console.log(example4)
in the command console. Did example4
parse successfully? Why
or why not?
console.log(example5)
in the command console. Did example5
parse successfully? Why
or why not?
example1.success
is true
. Two interpretations were returned in the array, example1.interpretations
. In the first interpretation rubywas defined as referencing the slipper. In the second interpretation it was defined as referencing the ring. From a syntactic point of view both interpretations are valid, but from a semantic point of view only the first interpretation defines ruby correctly. In part two we'll learn how to write a semantics function to narrow the interpretations to the correct definition.
example1.interpretations[0].gist
. The .gist
property of the interpretation is the root node of the syntax tree.
console.log()
, as implemented in some browsers, still insists on alphabetizing object properties for your convenience, which is confusing, but harmless.)maximum
option for the .adjectives
rule was set to more than 1.example2.success
is true
.
minimum
option for the article rule was set to 0.example3.interpretations[0].gist.article
is undefined
.example3.success
is true
. Both the article and adjectives are optional for the noun phrase.
example4.success
is false
.
take ruby slipper,and two where ruby is a a noun, "take ruby." These interpretations fit the grammar, but they only partially consumme the input text.
.remainder
property, example4.interpretations[0].remainder
. When parsing fails, all attempted interpretations are offered, with the array sorted from longest to shortest remainder.example5.success
is false
. The input text does not match the command
rule. The text really prettydoes not match the
adjectives
rule or the noun
rule. Therefore, the nounPhrase
rule fails, which causes the command
rule to fail. By the way, it is possible to match text not found in the lexicon by using the regex
option. Regular expressions are cover in part two of this tutorial.Continue to part two to learn about more complex grammars, including choice rules and semantic analysis.