Prerequisites: You should have read Parsing Tutorial Part Two.
Part three of the parsing tutorial focuses on recursive rules. The code listings are also embedded as a script for this page. Open your web browser's developer tools to try the parsing examples.
//Code Listing 1 var lexicon = ishml.Lexicon().register("baa").as({part:"bleat" }) var bleat=ishml.Rule().configure({filter:(definition)=>definition.part==="bleat"}) var goat=ishml.Rule().configure({mode:ishml.enum.mode.apt}) //Correct .snip(1) .snip(2) goat[1].snip("bleat",bleat.clone()).snip("goat",goat) goat[2].snip("bleat",bleat.clone()) var kid=goat.clone().configure({mode:ishml.enum.mode.any}) //Inefficient var sheep=ishml.Rule().configure({mode:ishml.enum.mode.apt}) //Wrong .snip(1) .snip(2) sheep[1].snip("bleat",bleat.clone()) sheep[2].snip("bleat",bleat.clone()).snip("sheep",sheep) var wolf=ishml.Rule() //Wrong-- stack overflow .snip(1) .snip(2) wolf[1].snip("sheep",wolf).snip("bleat",bleat.clone()) wolf[2].snip("bleat",bleat.clone()) var lambs=ishml.Rule().snip("bleat") //Correct lambs.bleat.configure({maximum:Infinity})
In the grammar presented in part two, we used .clone()
to make deep copies of the nounPhrase
rule so that we could modify the copies however we wanted. Had we not cloned the rule, the property settings for the subject, direct object, indirect object would have each overwritten each other and created bugs in the program because all of the rules point to the one and only nounPhrase
rule.
Cloning rules is generally safer. However, there are grammars that rely on recursively defined rules, where cloning would not have the desired outcome.
Refer to Code Listing 1. Entering ishml.Parser({lexicon:lexicon,grammar:goat}).analyze("baa baa baa")
in the console produces a syntax trees with goat
properties nested inside each other and one bleat
token at each level. This is achieved by recursively defning the goat
rule. One of the sub-rules of goat[1]
is the goat
rule itself.
The goat
rule is right recursive. Rules are right recursive if there is at least one token that is consumed before the rule recurses. In ISHML, sub-rule sequences are always processed from left to right. That is, the sub-rules are processed in the order in which they were created (with .snip
. In goat[1]
the bleat
sub-rule consumes a token prior to the goat sub-rule, which recurses back to the original goat
rule.
The wolf
rule is left recursive. Entering ishml.Parser({lexicon:lexicon,grammar:wolf}).analyze("baa baa baa")
in the console produces a stack overflow! The wolf[1]
sub-rule recursively calls the wolf
rule before consuming any tokens, which in turn runs wolf[1]
again, creating an infinite loop. Left recursion must be eliminated from all ISHML grammars!
Recursion can be very powerful, but it can also make the rules harder to understand. Depending on your needs, consider switching to a simple array of repeating elements as shown in the lambs
rule.
The goat rula also introduces a new mode: ishml.enum.mode.apt
. This is another option for choice rules, very similar to ishml.enum.mode.any
. The main difference is that .apt
rules stop evaluating choices after the first successful match. In the case of the goat
rule, if the input text consists of a baa followed by more baa(s) the goat[1]
rule matches the input and goat[2]
isn't evaluated. Eventually, when all the other tokens have been consumed and there is only one baa remaining, goat[1]
will fail and goat[2]
will process the last baa.
The kid
rule is similar to the goat rule except that the mode
has been changes to .any
. This rule will structure the ouput similar to the goat
rule, but it is less efficient, because both alternatives, kid[1]
and kid[2]
, are always evaluated, even if the kid[1]
rule is successful. If your grammar and lexicon are unambigous, you should prefer .apt
to .any
The sheep
rule contains a mistake. It looks a lot like the goat
rule, but sheep[1]
and sheep[2]
rules are in reverse order compared to goat[1]
and goat[2]
. Because mode
is set to .apt
, the rule consumes one token when sheep[1]
executes and then stops, resulting in a partial interpretation.
//Code Listing 2 var lexicon=ishml.Lexicon() --reusing lexicon from code listing 1 above. lexicon .register("(").as({part:"begin"}) .register(")").as({part:"end"}) .register("+").as({part:"termOp", operation:(a,b)=>a+b}) .register("-").as({part:"termOp",operation:(a,b)=>a-b}) .register("*").as({part:"factorOp",operation:(a,b)=>a*b}) .register("/").as({part:"factorOp",operation:(a,b)=>a/b}) .register("^","**").as({part:"powerOp",operation:(a,b)=>a**b}) var expression=ishml.Rule().configure({separator:/^\s*/}) var term=ishml.Rule().configure({separator:/^\s*/}) var power=ishml.Rule().configure({separator:/^\s*/}) var group=ishml.Rule().configure({separator:/^\s*/}) var number=ishml.Rule().configure({separator:/^\s*/}) expression.snip("term",term).snip("operations") expression.operations.snip("operator").snip("operand",term) .configure({minimum:0, maximum:Infinity,greedy:true}) expression.operations.operator.configure({filter:(definition)=>definition.part==="termOp",longest:true}) term.snip("power",power).snip("operations") term.operations.snip("operator").snip("operand",power) .configure({minimum:0, maximum:Infinity, greedy:true}) term.operations.operator.configure({filter:(definition)=>definition.part==="factorOp",longest:true}) power.snip("group",group).snip("operations") power.operations.snip("operator").snip("operand",power) .configure({minimum:0, greedy:true}) power.operations.operator.configure({filter:(definition)=>definition.part==="powerOp",longest:true}) group.configure({mode:ishml.enum.mode.apt}) .snip(1) .snip(2,number) group[1].snip("begin").snip("expression",expression).snip("end") group[1].begin.configure({keep:false,filter:(definition)=>definition.part==="begin"}) group[1].end.configure({keep:false,filter:(definition)=>definition.part==="end"}) number.configure({regex:/^-?([0-9]*[.])?[0-9]+/}) .semantics=(interpretation)=> { interpretation.gist=Number(interpretation.gist.lexeme) return interpretation } getOperation=(interpretation)=> { interpretation.gist=interpretation.gist.definition.operation return interpretation } calculate=(interpretation)=> { var {gist} = interpretation var {expression, term, power, group, operand,operations}=gist var result=expression || term || power || group ||operand|| gist if (operations) { if (operations instanceof Array) { operations.forEach(function(operation) { result=operation.operator(result,operation.operand) }) } else { result=operations.operator(result,operations.operand) } } interpretation.gist=result return interpretation } expression.operations.operator.semantics=getOperation term.operations.operator.semantics=getOperation power.operations.operator.semantics=getOperation expression.semantics=calculate term.semantics=calculate group.semantics=calculate power.semantics=calculate var parser=ishml.Parser({lexicon:lexicon,grammar:expression}) console.log(parser.analyze("18/6/3")) // 1 console.log(parser.analyze("(1+2)*3+5")) // 14 console.log(parser.analyze("(2^3-3)*4")) // 20
A classic example of recursion is the grammar for basic arithmetic expressions shown in Code Listing 2. It parses and evaluates expressions involving addition, subtraction, multiplication, divsion, and exponentiation. For the sake of brevity the grammar does not support the unary negation operator.
To review basic arithmetic, recall that addition, subtraction, multiplication, and division operations are performed from left to right (left associative operations) with multiplication and division taking precedence over addition and subtraction. Exponentiation is performed right to left (right associative) and has precedence over multiplication and division. The order of these operations may also be overridden by the use of parentheses.
The lexicon defines the operators. Notice that functions to perform the operations are stored as part of each operator's definition. This will simplify the semantic processing to evaluate the expression.
Due to the recursive nature of this grammar, the expression
, term
, power
, and group
rules reference each other. It is perfectly acceptable to use a rule in another rule before it is configured. In the case of this grammar, it is absolutely necessary.
In the group
rule we alos see .apt
mode as opposed to .any
mode. If we're able to evaluate as an expression wrapped in parentheses, there is no point in considering evaluating as a number.
The expression
rule is the top level rule for the grammar and breaks down the input text into a series of terms that can be added or subtracted. The term
rule further decomposes each term into a series of powers that can be multiplied or divided.
The power
rule decomposes each power into groups that can be exponentiated. The group
rule decomposes a group into either an expression to be recursively evaluated by the expression
rule or a number. The power
rule also uses recursion to nest the power nodes in the syntax tree. The semantics
function attached to these nodes works from the leaves up, carrying the result up the tree in right associative order. For the left associative operations, the .operations
array is iterated over, applying the semantics
function on the operands and accumulating the result.
The number
rule uses a regular expression instead of the lexicon to match numbers, by defining the regex
property. Note the use of ^
to anchor the search to the head of the input text. Since the lexicon is not used for this rule, there is no token definition to be retrieved. Instead, a token is constructed. The token's lexeme is set to the string returned by the regular expression match. The token's definition is simply set to {fuzzy:true}
.
Let's look at a few of the options that were used in the calculator grammar to change the default behavior of the rules.
This grammar allows white space, but does not require it. 1 + 1
is acceptable, but so is 1+1
. Setting the separator
option to {separator:/^\s*/}
makes whitespace between lexemes optional.
By default, the parser expects ambiguity; all combinations of repeating items are consider from minimum to maximum and all potential choices are evaluated; all potential matches against the lexicon from shortest to longest are considered. This is unnecessary for a calculator. The result of an arithmetic expression is unambigous; there is one and only one answer. The greedy
option causes the parser to only consider the longest series whem maximum
is set to more than one. The longest
option causes only longest possible match to be considered when retrieving tokens from the lexicon. For example, in this grammar, ""**" may be used for exponentiation. The parser doesn't need to consider "**" as two multiplication symbols in a row. Setting longest
to true
will avoid the parser even considering the muliplication token when it appears twice in row. Instead it will only consider the longer token for exponentiation. The greedy
and longest
options allow the parser to operate more efficiently, but should only be used where the grammar is unambiguous.
The keep
option may be set to false to discard some branches and nodes. In this grammar the left and right parentheses are discarded. While they do identify the start and end of a group, the parentheses are not actually needed for the calculation.
This has been a brief tour of parsing using the ISHML API. We did not cover all the configure options for rules and parsing. See the API Reference for complete documentation. The lexicon and grammar discussed in this tutorial is attached as a script for this web page. Open your browser's developer tools and experiment.