Prerequisites: You should have read Parsing Tutorial Part Two.

Part three of the parsing tutorial focuses on recursive rules. The code listings are also embedded as a script for this page. Open your web browser's developer tools to try the parsing examples.

//Code Listing 1 var lexicon = ishml.Lexicon().register("baa").as({part:"bleat" }) var bleat=ishml.Rule().configure({filter:(definition)=>definition.part==="bleat"}) var goat=ishml.Rule().configure({mode:ishml.enum.mode.apt}) //Correct .snip(1) .snip(2) goat[1].snip("bleat",bleat.clone()).snip("goat",goat) goat[2].snip("bleat",bleat.clone()) var kid=goat.clone().configure({mode:ishml.enum.mode.any}) //Inefficient var sheep=ishml.Rule().configure({mode:ishml.enum.mode.apt}) //Wrong .snip(1) .snip(2) sheep[1].snip("bleat",bleat.clone()) sheep[2].snip("bleat",bleat.clone()).snip("sheep",sheep) var wolf=ishml.Rule() //Wrong-- stack overflow .snip(1) .snip(2) wolf[1].snip("sheep",wolf).snip("bleat",bleat.clone()) wolf[2].snip("bleat",bleat.clone()) var lambs=ishml.Rule().snip("bleat") //Correct lambs.bleat.configure({maximum:Infinity})

In the grammar presented in part two, we used `.clone()`

to make deep copies of the `nounPhrase`

rule so that we could modify the copies however we wanted. Had we not cloned the rule, the property settings for the subject, direct object, indirect object would have each overwritten each other and created bugs in the program because all of the rules point to the one and only `nounPhrase`

rule.

Cloning rules is generally safer. However, there are grammars that rely on recursively defined rules, where cloning would not have the desired outcome.

Refer to Code Listing 1. Entering `ishml.Parser({lexicon:lexicon,grammar:goat}).analyze("baa baa baa")`

in the console produces a syntax trees with `goat`

properties nested inside each other and one `bleat`

token at each level. This is achieved by recursively defning the `goat`

rule. One of the sub-rules of `goat[1]`

is the `goat`

rule itself.

The `goat`

rule is right recursive. Rules are right recursive if there is at least one token that is consumed before the rule recurses. In ISHML, sub-rule sequences are always processed from left to right. That is, the sub-rules are processed in the order in which they were created (with `.snip`

. In `goat[1]`

the `bleat`

sub-rule consumes a token prior to the goat sub-rule, which recurses back to the original `goat`

rule.

The `wolf`

rule is left recursive. Entering `ishml.Parser({lexicon:lexicon,grammar:wolf}).analyze("baa baa baa")`

in the console produces a stack overflow! The `wolf[1]`

sub-rule recursively calls the `wolf`

rule before consuming any tokens, which in turn runs `wolf[1]`

again, creating an infinite loop. Left recursion must be eliminated from all ISHML grammars!

Recursion can be very powerful, but it can also make the rules harder to understand. Depending on your needs, consider switching to a simple array of repeating elements as shown in the `lambs`

rule.

The goat rula also introduces a new mode: `ishml.enum.mode.apt`

. This is another option for choice rules, very similar to `ishml.enum.mode.any`

. The main difference is that `.apt`

rules stop evaluating choices after the first successful match. In the case of the `goat`

rule, if the input text consists of a *baa* followed by more *baa*(s) the `goat[1]`

rule matches the input and `goat[2]`

isn't evaluated. Eventually, when all the other tokens have been consumed and there is only one *baa* remaining, `goat[1]`

will fail and `goat[2]`

will process the last *baa*.

The `kid`

rule is similar to the goat rule except that the `mode`

has been changes to `.any`

. This rule will structure the ouput similar to the `goat`

rule, but it is less efficient, because both alternatives, `kid[1]`

and `kid[2]`

, are always evaluated, even if the `kid[1]`

rule is successful. If your grammar and lexicon are unambigous, you should prefer `.apt`

to `.any`

The `sheep`

rule contains a mistake. It looks a lot like the `goat`

rule, but `sheep[1]`

and `sheep[2]`

rules are in reverse order compared to `goat[1]`

and `goat[2]`

. Because `mode`

is set to `.apt`

, the rule consumes one token when `sheep[1]`

executes and then stops, resulting in a partial interpretation.

//Code Listing 2 var lexicon=ishml.Lexicon() --reusing lexicon from code listing 1 above. lexicon .register("(").as({part:"begin"}) .register(")").as({part:"end"}) .register("+").as({part:"termOp", operation:(a,b)=>a+b}) .register("-").as({part:"termOp",operation:(a,b)=>a-b}) .register("*").as({part:"factorOp",operation:(a,b)=>a*b}) .register("/").as({part:"factorOp",operation:(a,b)=>a/b}) .register("^","**").as({part:"powerOp",operation:(a,b)=>a**b}) var expression=ishml.Rule().configure({lax:true}) var term=ishml.Rule().configure({lax:true}) var power=ishml.Rule().configure({lax:true}) var group=ishml.Rule().configure({lax:true}) var number=ishml.Rule().configure({lax:true}) expression.snip("term",term).snip("operations") expression.operations.snip("operator").snip("operand",term) .configure({minimum:0, maximum:Infinity,greedy:true}) expression.operations.operator.configure({filter:(definition)=>definition.part==="termOp",longest:true}) term.snip("power",power).snip("operations") term.operations.snip("operator").snip("operand",power) .configure({minimum:0, maximum:Infinity, greedy:true}) term.operations.operator.configure({filter:(definition)=>definition.part==="factorOp",longest:true}) power.snip("group",group).snip("operations") power.operations.snip("operator").snip("operand",power) .configure({minimum:0, greedy:true}) power.operations.operator.configure({filter:(definition)=>definition.part==="powerOp",longest:true}) group.configure({mode:ishml.enum.mode.apt}) .snip(1) .snip(2,number) group[1].snip("begin").snip("expression",expression).snip("end") group[1].begin.configure({keep:false,filter:(definition)=>definition.part==="begin"}) group[1].end.configure({keep:false,filter:(definition)=>definition.part==="end"}) number.configure({regex:/^-?([0-9]*[.])?[0-9]+/}) .semantics=(interpretation)=> { interpretation.gist=Number(interpretation.gist.lexeme) return interpretation } getOperation=(interpretation)=> { interpretation.gist=interpretation.gist.definitions[0].operation return interpretation } calculate=(interpretation)=> { var {gist} = interpretation var {expression, term, power, group, operand,operations}=gist var result=expression || term || power || group ||operand|| gist if (operations) { if (operations instanceof Array) { operations.forEach(function(operation) { result=operation.operator(result,operation.operand) }) } else { result=operations.operator(result,operations.operand) } } interpretation.gist=result return interpretation } expression.operations.operator.semantics=getOperation term.operations.operator.semantics=getOperation power.operations.operator.semantics=getOperation expression.semantics=calculate term.semantics=calculate group.semantics=calculate power.semantics=calculate var parser=ishml.Parser({lexicon:lexicon,grammar:expression}) console.log(parser.analyze("18/6/3")) // 1 console.log(parser.analyze("(1+2)*3+5")) // 14 console.log(parser.analyze("(2^3-3)*4")) // 20

A classic example of recursion is the grammar for basic arithmetic expressions shown in Code Listing 2. It parses and evaluates expressions involving addition, subtraction, multiplication, divsion, and exponentiation. For the sake of brevity the grammar does not support the unary negation operator.

To review basic arithmetic, recall that addition, subtraction, multiplication, and division operations are performed from left to right (left associative operations) with multiplication and division taking precedence over addition and subtraction. Exponentiation is performed right to left (right associative) and has precedence over multiplication and division. The order of these operations may also be overridden by the use of parentheses.

The lexicon defines the operators. Notice that functions to perform the operations are stored as part of each operator's definition. This will simplify the semantic processing to evaluate the expression.

Due to the recursive nature of this grammar, the `expression`

, `term`

, `power`

, and `group`

rules reference each other. It is perfectly acceptable to use a rule in another rule before it is configured. In the case of this grammar, it is absolutely necessary.

In the `group`

rule we alos see `.apt`

mode as opposed to `.any`

mode. If we're able to evaluate as an expression wrapped in parentheses, there is no point in considering evaluating as a number.

The `expression`

rule is the top level rule for the grammar and breaks down the input text into a series of terms that can be added or subtracted. The `term`

rule further decomposes each term into a series of powers that can be multiplied or divided.

The `power`

rule decomposes each power into groups that can be exponentiated. The `group`

rule decomposes a group into either an expression to be recursively evaluated by the `expression`

rule or a number. The `power`

rule also uses recursion to nest the power nodes in the syntax tree. The `semantics`

function attached to these nodes works from the leaves up, carrying the result up the tree in right associative order. For the left associative operations, the `.operations`

array is iterated over, applying the `semantics`

function on the operands and accumulating the result.

The `number`

rule uses a regular expression instead of the lexicon to match numbers, by defining the `regex`

property. Note the use of ^

to anchor the search to the head of the input text. Since the lexicon is not used for this rule, there is no token definition to be retrieved. Instead, a token is constructed. The token's lexeme is set to the string returned by the regular expression match. The token's definition is simply set to `{fuzzy:true}`

.

Let's look at a few of the options that were used in the calculator grammar to change the default behavior of the rules.

This grammar allows white space, but does not require it. 1 + 1

is acceptable, but so is 1+1

. Setting the `lax`

option to `true`

makes separators between lexemes optional.

By default, the parser expects ambiguity; all combinations of repeating items are consider from minimum to maximum and all potential choices are evaluated; all potential matches against the lexicon from shortest to longest are considered. This is unnecessary for a calculator. The result of an arithmetic expression is unambigous; there is one and only one answer. The `greedy`

option causes the parser to only consider the longest series whem `maximum`

is set to more than one. The `longest`

option causes only longest possible match to be considered when retrieving tokens from the lexicon. For example, in this grammar, ""**" may be used for exponentiation. The parser doesn't need to consider "**" as two multiplication symbols in a row. Setting `longest`

to `true`

will avoid the parser even considering the muliplication token when it appears twice in row. Instead it will only consider the longer token for exponentiation. The `greedy`

and `longest`

options allow the parser to operate more efficiently, but should only be used where the grammar is unambiguous.

The `keep`

option may be set to false to discard some branches and nodes. In this grammar the left and right parentheses are discarded. While they do identify the start and end of a group, the parentheses are not actually needed for the calculation.

This has been a brief tour of parsing using the ISHML API. We did not cover all the configure options for rules and parsing. See the API Reference for complete documentation. The lexicon and grammar discussed in this tutorial is attached as a script for this web page. Open your browser's developer tools and experiment.