Parsing Tutorial Part Three

Prerequisites: You should have read Parsing Tutorial Part Two.

Part three of the parsing tutorial focuses on recursive rules. The code listings are also embedded as a script for this page. Open your web browser's developer tools to try the parsing examples.

//Code Listing 1
var lexicon = ishml.Lexicon().register("baa").as({part:"bleat" })
                            
var bleat=ishml.Rule().configure({filter:(definition)=>definition.part==="bleat"})	

var goat=ishml.Rule().configure({mode:ishml.enum.mode.apt})  //Correct
	.snip(1)
	.snip(2)
goat[1].snip("bleat",bleat.clone()).snip("goat",goat)
goat[2].snip("bleat",bleat.clone())

var kid=goat.clone().configure({mode:ishml.enum.mode.any}) //Inefficient

var sheep=ishml.Rule().configure({mode:ishml.enum.mode.apt}) //Wrong
	.snip(1)
	.snip(2)
sheep[1].snip("bleat",bleat.clone())	
sheep[2].snip("bleat",bleat.clone()).snip("sheep",sheep)

var wolf=ishml.Rule() //Wrong-- stack overflow
	.snip(1)
	.snip(2)
wolf[1].snip("sheep",wolf).snip("bleat",bleat.clone())
wolf[2].snip("bleat",bleat.clone())

var lambs=ishml.Rule().snip("bleat") //Correct
lambs.bleat.configure({maximum:Infinity})

Recursion

In the grammar presented in part two, we used .clone() to make deep copies of the nounPhrase rule so that we could modify the copies however we wanted. Had we not cloned the rule, the property settings for the subject, direct object, indirect object would have each overwritten each other and created bugs in the program because all of the rules point to the one and only nounPhrase rule.

Cloning rules is generally safer. However, there are grammars that rely on recursively defined rules, where cloning would not have the desired outcome.

Refer to Code Listing 1. Entering ishml.Parser({lexicon:lexicon,grammar:goat}).analyze("baa baa baa") in the console produces a syntax trees with goat properties nested inside each other and one bleat token at each level. This is achieved by recursively defning the goat rule. One of the sub-rules of goat[1] is the goat rule itself.

The goat rule is right recursive. Rules are right recursive if there is at least one token that is consumed before the rule recurses. In ISHML, sub-rule sequences are always processed from left to right. That is, the sub-rules are processed in the order in which they were created (with .snip. In goat[1] the bleat sub-rule consumes a token prior to the goat sub-rule, which recurses back to the original goat rule.

The wolf rule is left recursive. Entering ishml.Parser({lexicon:lexicon,grammar:wolf}).analyze("baa baa baa") in the console produces a stack overflow! The wolf[1] sub-rule recursively calls the wolf rule before consuming any tokens, which in turn runs wolf[1] again, creating an infinite loop. Left recursion must be eliminated from all ISHML grammars!

Recursion can be very powerful, but it can also make the rules harder to understand. Depending on your needs, consider switching to a simple array of repeating elements as shown in the lambs rule.

Apt Mode

The goat rula also introduces a new mode: ishml.enum.mode.apt. This is another option for choice rules, very similar to ishml.enum.mode.any. The main difference is that .apt rules stop evaluating choices after the first successful match. In the case of the goat rule, if the input text consists of a baa followed by more baa(s) the goat[1] rule matches the input and goat[2]isn't evaluated. Eventually, when all the other tokens have been consumed and there is only one baa remaining, goat[1] will fail and goat[2] will process the last baa.

The kid rule is similar to the goat rule except that the mode has been changes to .any. This rule will structure the ouput similar to the goat rule, but it is less efficient, because both alternatives, kid[1] and kid[2], are always evaluated, even if the kid[1] rule is successful. If your grammar and lexicon are unambigous, you should prefer .apt to .any

The sheep rule contains a mistake. It looks a lot like the goat rule, but sheep[1] and sheep[2] rules are in reverse order compared to goat[1] and goat[2]. Because mode is set to .apt, the rule consumes one token when sheep[1] executes and then stops, resulting in a partial interpretation.

//Code Listing 2			
var lexicon=ishml.Lexicon() --reusing lexicon from code listing 1 above.
lexicon
	.register("(").as({part:"begin"})
	.register(")").as({part:"end"})
	.register("+").as({part:"termOp", operation:(a,b)=>a+b})
	.register("-").as({part:"termOp",operation:(a,b)=>a-b})
	.register("*").as({part:"factorOp",operation:(a,b)=>a*b})
	.register("/").as({part:"factorOp",operation:(a,b)=>a/b})
	.register("^","**").as({part:"powerOp",operation:(a,b)=>a**b})

var expression=ishml.Rule().configure({separator:/^\s*/})
var term=ishml.Rule().configure({separator:/^\s*/})	
var power=ishml.Rule().configure({separator:/^\s*/})
var group=ishml.Rule().configure({separator:/^\s*/})
var number=ishml.Rule().configure({separator:/^\s*/})

expression.snip("term",term).snip("operations")
expression.operations.snip("operator").snip("operand",term)
	.configure({minimum:0, maximum:Infinity,greedy:true})
expression.operations.operator.configure({filter:(definition)=>definition.part==="termOp",longest:true})

term.snip("power",power).snip("operations")
term.operations.snip("operator").snip("operand",power)
	.configure({minimum:0, maximum:Infinity, greedy:true})
term.operations.operator.configure({filter:(definition)=>definition.part==="factorOp",longest:true})

power.snip("group",group).snip("operations")
power.operations.snip("operator").snip("operand",power)
	.configure({minimum:0, greedy:true})
power.operations.operator.configure({filter:(definition)=>definition.part==="powerOp",longest:true})

group.configure({mode:ishml.enum.mode.apt})
	.snip(1)
	.snip(2,number)
group[1].snip("begin").snip("expression",expression).snip("end")
group[1].begin.configure({keep:false,filter:(definition)=>definition.part==="begin"})         
group[1].end.configure({keep:false,filter:(definition)=>definition.part==="end"}) 

number.configure({regex:/^-?([0-9]*[.])?[0-9]+/})
	.semantics=(interpretation)=>
	{
		interpretation.gist=Number(interpretation.gist.lexeme)
		return interpretation
	}

getOperation=(interpretation)=>
{
	interpretation.gist=interpretation.gist.definition.operation
	return interpretation
}

calculate=(interpretation)=>
{
	var {gist} = interpretation
	var {expression, term, power, group, operand,operations}=gist
	var result=expression || term || power || group ||operand|| gist
	if (operations)
	{
		if (operations instanceof Array)
		{
			operations.forEach(function(operation)
			{
			result=operation.operator(result,operation.operand)
			})
		}
		else
		{
			result=operations.operator(result,operations.operand)
		}	
		
	}
	interpretation.gist=result
	return interpretation
}

expression.operations.operator.semantics=getOperation
term.operations.operator.semantics=getOperation	
power.operations.operator.semantics=getOperation

expression.semantics=calculate
term.semantics=calculate    
group.semantics=calculate
power.semantics=calculate

var parser=ishml.Parser({lexicon:lexicon,grammar:expression}) 

console.log(parser.analyze("18/6/3"))  // 1
console.log(parser.analyze("(1+2)*3+5")) // 14
console.log(parser.analyze("(2^3-3)*4")) // 20

More Recursion

A classic example of recursion is the grammar for basic arithmetic expressions shown in Code Listing 2. It parses and evaluates expressions involving addition, subtraction, multiplication, divsion, and exponentiation. For the sake of brevity the grammar does not support the unary negation operator.

To review basic arithmetic, recall that addition, subtraction, multiplication, and division operations are performed from left to right (left associative operations) with multiplication and division taking precedence over addition and subtraction. Exponentiation is performed right to left (right associative) and has precedence over multiplication and division. The order of these operations may also be overridden by the use of parentheses.

The lexicon defines the operators. Notice that functions to perform the operations are stored as part of each operator's definition. This will simplify the semantic processing to evaluate the expression.

Due to the recursive nature of this grammar, the expression, term, power, and group rules reference each other. It is perfectly acceptable to use a rule in another rule before it is configured. In the case of this grammar, it is absolutely necessary.

In the group rule we alos see .apt mode as opposed to .any mode. If we're able to evaluate as an expression wrapped in parentheses, there is no point in considering evaluating as a number.

The expression rule is the top level rule for the grammar and breaks down the input text into a series of terms that can be added or subtracted. The term rule further decomposes each term into a series of powers that can be multiplied or divided.

The power rule decomposes each power into groups that can be exponentiated. The group rule decomposes a group into either an expression to be recursively evaluated by the expression rule or a number. The power rule also uses recursion to nest the power nodes in the syntax tree. The semantics function attached to these nodes works from the leaves up, carrying the result up the tree in right associative order. For the left associative operations, the .operations array is iterated over, applying the semantics function on the operands and accumulating the result.

The number rule uses a regular expression instead of the lexicon to match numbers, by defining the regex property. Note the use of ^ to anchor the search to the head of the input text. Since the lexicon is not used for this rule, there is no token definition to be retrieved. Instead, a token is constructed. The token's lexeme is set to the string returned by the regular expression match. The token's definition is simply set to {fuzzy:true}.

Additional Options

Let's look at a few of the options that were used in the calculator grammar to change the default behavior of the rules.

This grammar allows white space, but does not require it. 1 + 1 is acceptable, but so is 1+1. Setting the separator option to {separator:/^\s*/} makes whitespace between lexemes optional.

By default, the parser expects ambiguity; all combinations of repeating items are consider from minimum to maximum and all potential choices are evaluated; all potential matches against the lexicon from shortest to longest are considered. This is unnecessary for a calculator. The result of an arithmetic expression is unambigous; there is one and only one answer. The greedy option causes the parser to only consider the longest series whem maximum is set to more than one. The longest option causes only longest possible match to be considered when retrieving tokens from the lexicon. For example, in this grammar, ""**" may be used for exponentiation. The parser doesn't need to consider "**" as two multiplication symbols in a row. Setting longest to true will avoid the parser even considering the muliplication token when it appears twice in row. Instead it will only consider the longer token for exponentiation. The greedy and longest options allow the parser to operate more efficiently, but should only be used where the grammar is unambiguous.

The keep option may be set to false to discard some branches and nodes. In this grammar the left and right parentheses are discarded. While they do identify the start and end of a group, the parentheses are not actually needed for the calculation.

Next Steps

This has been a brief tour of parsing using the ISHML API. We did not cover all the configure options for rules and parsing. See the API Reference for complete documentation. The lexicon and grammar discussed in this tutorial is attached as a script for this web page. Open your browser's developer tools and experiment.