6 Syntax analysis 2: the bottom-up expression parser -------------------------------------------------- 6.1 Map and structure ----------------- It would be possible to continue the top-down parser down into the level of expressions: for instance, to have a routine parse_plus() which would parse_expression(), then match a plus sign, then parse_expression() again, and so on. This would however be highly inefficient in terms of function calls, difficult to recover from when an error occurs, and require either much greater lookahead or rather complicated mechanisms for storing parse trees. Expressions (including assignments, conditions and constant values) are therefore parsed by a different syntax analyser, which is "bottom up" (it begins by reading in the tokens, and only works out what they mean afterward) rather than "top down" (starting from an assumption about the meaning and then trying to find the evidence to justify it). The algorithm used is an operator precedence parser, a simple form of shift-reduce parser. There is alas no space to describe this beautiful algorithm here: see ASU, pp.195-215. In this account we treat it as a "black box". The expression parser "expressp.c" has the structure: --------> "Level 4" of --------> Operator tokens lexical analysis etokens precedence in parser | | etokens re-ordered into RPN \|/ Emitter | | plain parse tree \|/ Lvalue and condition checking | | more detailed parse tree \|/ out (This is effectively a magnification of the part of the source map in section 1.2 above which reads just "expressp.c".) Note that there is a single input and a single output channel. RPN is "reverse Polish notation": for example (2 + 4)*45 + 34 becomes 2 4 + 45 * 34 + the idea being that one reads it left to right: each number is added to a stack of numbers, and each operation takes some numbers off the stack and puts an answer back. (E.g., + takes 2 and 4 off and replaces them with 6. * takes 6 and 45 off, replacing them with 270. + takes 34 and 270 off, replacing them with 304, which is the answer.) The advantage of this is that is unambiguous which operator acts on which operands, and the brackets have gone. The emitter builds a (plain) parse tree out of this sequence, as follows. It stacks 2 and then 4: Stack: 2 4 It reads +, which it knows has two operands, and takes the top two items off the stack, joining them into a tree segment like so: + / \ 2 4 It then makes a special value, say E1, meaning "the sub-expression in this tree", and stacks that: Stack: E1 It then stacks up 45, and reaches *: it takes off the top two values, which are E1 and 45, and makes * / \ E1 45 which it calls E2, and so on. When the process is finished, we have the tree: + <--- this is E3 / \ this is E2 ---> * 34 / \ this is E1 ---> + 45 / \ 2 4 and the stack contains just E3, which is the "answer". 6.2 The operator precedence grammar ------------------------------- A list of productions for this grammar would be tiresomely long. It can be roughly summarised as: expression -> ( expression ) a single operator acting on some operands in some way operand -> token representing a constant ( expression ) expression ( arguments ) arguments -> expression , arguments The tokens representing a constant are given in the next section. The operators have tokens as follows: -------------------------------------------------------------------------- Level Lexeme Usage Associativity Purpose -------------------------------------------------------------------------- 0 , binary left separating values to work out 1 = binary right set equal to 2 && binary left logical AND 2 || binary left logical OR 2 ~~ unary (prefix) logical NOT 3 == binary none equal to? 3 ~= binary none not equal to? 3 > binary none greater than? 3 >= binary none greater than or equal to? 3 < binary none less than? 3 <= binary none less than or equal to? 3 has binary none object has this attribute? 3 hasnt binary none object hasn't this attribute? 3 in binary none first obj a child of second? 3 notin binary none first obj not a child of second? 3 ofclass binary none obj inherits from class? 3 provides binary none obj provides this property? 4 or binary left separating alternative values 5 + binary left 16-bit signed addition 5 - binary left 16-bit signed subtraction 6 * binary left 16-bit signed multiplication 6 / binary left 16-bit signed integer division 6 % binary left 16-bit signed remainder 6 & binary left bitwise AND 6 | binary left bitwise OR 6 ~ unary (prefix) bitwise NOT 7 -> binary left byte array entry 7 --> binary left word array entry 8 - unary (prefix) 16-bit (signed!) negation 9 ++ unary (pre/postfix) read/increment or increment/read 9 -- unary (pre/postfix) read/decrement or decrement/read 10 .& binary left property address 10 .# binary left property length 10 ..& (**) binary left individual property address 10 ..# (**) binary left individual property length 11/14 ( ) binary left function call/message send 12 . binary left property value 12 .. (**) binary left individual property value 13 :: binary left "superclass" operator -------------------------------------------------------------------------- (**): Illegal except internally, in Inform's own source for the run-time veneer -------------------------------------------------------------------------- Note that, unlike C, Inform does not provide unary plus. A binary infix operator is used in the form "X op Y"; a unary prefix operator in the form "op X"; a unary postfix operator in the form "X op". The "level" is the precedence level: the higher this number, the more tightly an operator is glued to the operands adjacent to it. Cases where two binary infix operators have equal precedence are resolved according to whether that level is left or right associative. For instance, a / b / c means (a/b)/c since /, on level 6, is left associative. But a = b = c means a = (b = c) since =, on level 1, is right associative. Cases in the above table where the associativity is "none" mean that an attempt to silently associate the operator will cause an error. For example, a == b == c generates an error as being ambiguous, but (a == b) == c and a == (b == c) are both permitted. The function-call-brackets have an asymmetric precedence level according to whether they're being compared with something on the left or on the right: the point is that object.message(...) function_returning_an_object(...).property must be recognised as (object.message)(...) . 12 > (...) 11 (function_returning_an_object(...)).property (...) 14 > . 12 respectively. 6.3 Lexical level 4: tokens to etokens ---------------------------------- The shift-reduce parser expects tokens sorted out into operators, such as "+" or "ofclass", operands such as "34" or "score" (assuming that's a variable) and three tokens of special significance: ( ) end-of-expression "Level 4 of the lexer" does just this. It converts the stream output from "lexer.c" (i.e., from level 3) into a stream of tokens from the following table (and it uses a good deal of context information to do so). Token type Meaning ------------------------------------------------------------------------- OP_TT Operator: value is a *_OP constant LARGE_NUMBER_TT Operand (number): value is the number SMALL_NUMBER_TT Operand (number guaranteed in the range 0 to 255): value is the number VARIABLE_TT Operand: value is Z-machine variable number DQ_TT Operand (text used as static string): text in text DICTWORD_TT Operand (text used as dictionary word): text in text ACTION_TT Operand (##action number): name of action in text SYSFUN_TT Operand (system fn name): value is a *_SYSF constant SYSTEM_CONSTANT_TT Operand (#system_constant): value is a *_SC constant SUBOPEN_TT Open bracket used to open a sub-expression SUBCLOSE_TT Close bracket used to close a sub-expression ENDEXP_TT End of expression ------------------------------------------------------------------------- The tokens in this output stream are called "etokens", short for "expression tokens": they are the raw material from which parse trees are built up. (Although formally different from ordinary tokens, they have the same storage type, "token_data", in the Inform source code.) At first sight this seems an unnecessarily large stock: why not, for example, compile static strings and dictionary words and convert them to their addresses here and now? Why not evaluate system constants and action numbers now? Because: (i) these tokens may not be accepted by the parser, so we can't compile text on the strength of them (and indeed, the parse tree which is being parsed may not ever be code-generated from); and (ii) we can't allow ourselves to lose sight of where numbers are coming from so soon, or backpatching will be impossible. The presence of two different tokens for numbers - one for numbers guaranteed to be unsigned 8 bit, one for numbers with no such guarantee and whose value may be any signed 16 bit quantity - is due to the architecture of the target Z-machine. In Z-code considerable space savings are made by storing small numbers in 1, rather than 2, bytes. We are anxious to take advantage of this. Unfortunately, what seems to be a small number now may not be a small number later (after backpatching). Dictionary word values (such as 'word') are internally stored as the accession number of that word into the dictionary (say 23, if it is the 23rd different word to be entered) during the compilation pass, and then backpatched later to the byte address of this word's dictionary entry: but this will be a large number. Most operands with non-NULL backpatch markers are forced into long form, to be on the safe side. (This includes all operands which are forward references to constants not yet defined.) The exceptions are that variable and action numbers (though not fake action numbers) are guaranteed always to fit into short form. There are two tasks involved in translating tokens to etokens: evaluating tokens representing quantities into one of the "operand" etokens, and sorting other tokens into the right operator and special etokens. The second process is not as easy as it looks because of ambiguities, and is discussed in the next section. This section discusses the evaluation of quantity tokens. Tokens in expressions are read in a context where the keywords are local variables, textual operators, system function names and system constant names. Other identifiers are symbol names. So the token types which must represent quantities are: DQ_TT, SQ_TT, LOCAL_VARIABLE_TT, NUMBER_TT, SYMBOL_TT, SYSFUN_TT and SYSTEM_CONSTANT_TT. (Actually, this omits a detail: the largely obsolete constant forms #a$..., #r$..., #n$... are SEP_TT tokens converted into the above; and the very much still with us constant form ##... is a SEP_TT token converted straightforwardly into the ACTION_TT etoken.) These all translate directly into etokens except for SYMBOL_TT and SQ_TT. SQ_TT requires a little work because of the Inform syntax for single-quoted strings: 'x' means the character code (i.e. ASCII value) for "x" '@ss' means the character code for the German letter "sz" (i.e. the code specified in the accented set in the Z-Machine standard document) 'xyzzy' means the byte address of the dictionary word "xyzzy" The first two become SMALL_NUMBER_TT etokens. The third becomes a DICTWORD_TT etoken. This leaves SYMBOL_TT tokens: names for constants. There are two possibilities: the symbol is so far unknown, or else it has been assigned a value in earlier syntax analysis. In most languages it would be an error to use an unknown symbol name as a value: in C, for example, a function name cannot be used in an expression unless it has been declared beforehand. Inform is more tolerant: as far as expressions are concerned, any symbol can be used earlier than the point in the source code where its value is assigned, excepting only a local or global variable name. (This exception is mainly for Z-machine architecture reasons, but it also makes Inform code more legible.) When an unknown symbol is reached, it is converted to a LARGE_NUMBER_TT and marked as SYMBOL_MV, with the value being its index in the symbol table. (This allows the backpatcher to know where to find the true value later.) When a known symbol is reached which is flagged as CHANGE_SFLAG, this is treated in the same way. (CHANGE_SFLAG is given to symbols defined by Constant definitions whose values themselves require backpatching: for instance, symbols defined equal to dictionary words or routine addresses.) Otherwise, the symbol is not only known, but its current value is known to be correct, and so: if it's a variable, it's translated to etoken VARIABLE_TT; otherwise the LARGE_NUMBER_TT/SMALL_NUMBER_TT decision is now made: if it is marked with a marker other than VARIABLE_MV, it becomes "long"; otherwise the value is compared with the range 0 to 255. Any symbol name fed into the shift-reduce parser is flagged with USED_SFLAG to indicate that its value has at some point been used in an expression. (This information enables "This ... was declared but not used" warnings to be made at the end of compilation.) 6.4 Resolution of ambiguous tokens ------------------------------ It remains to translate to the etokens OP_TT, SUBOPEN_TT, SUBCLOSE_TT and ENDEXP_TT. The token types which represent operators are all from SEP_TT and COND_TT; open and close bracket are also SEP_TT tokens. There are two main difficulties here. Firstly, shift-reduce parsers are unsuitable for resolving ambiguities, and so it is not possible to map tokens directly into operators. The ambiguities in the expression grammar for Inform are: (a) ++ and -- each refer to either a prefix or a postfix operator (b) - refers to either a unary prefix operator or a binary infix one (c) ( and ) are used both to indicate function calls and subexpressions (d) , is used both to indicate a sequence of values, each to be evaluated and thrown away with only the last one kept, and to separate function arguments (e) -> does not represent an operator when parsing constants as operand values in a line of "@" assembly language (because it needs to mean "the store value goes in") These are resolved as follows: (a), (b) The "level 4 lexer" watches for context: for instance, MINUS_SEP translates to UNARY_MINUS_OP if there was no previous token, or it was an open bracket, or an infix or prefix operator, or a comma; otherwise it translates to MINUS_OP. (c) Similarly. If an open bracket token is read, and the previous token was an operand of some kind, then it must represent a function call (unless we are in constant context); an extra etoken is generated for an imaginary infix operator called FCALL_OP. The open bracket token is then translated to SUBOPEN_TT as usual; a close bracket is always SUBCLOSE_TT. Thus, the stream Eroica ( Beethoven , 3 ) is translated into the etokens Eroica ( Beethoven 3 ) so that a tree will come out of the emitter looking like / \ Eroica / \ Beethoven 3 although in fact the emitter recognises what is going on and automatically simplifies this to: / | \ / | \ Eroica Beethoven 3 (d) A comma at the top level is either disallowed (in constant context) or treated as the operator . (Just as in C.) Otherwise, comma is a separator of arguments. "Top level" means "top bracket level", which the "level 4 lexer" keeps track of by counting the bracket tokens going through it. (e) There is a special expression context called "assembly language", in which "->" is translated to ENDEXP_TT. The second main difficulty is entirely of the language designer's own making. How is the end of an expression to be detected? In C, it always possible when parsing an expression to say "the end will have been reached when such-and-such a token is reached": for example, ";" or "}". This is not possible in Inform. Even if one knows in advance what the terminating token ought to be, it might be impossible to recognise for context reasons. For example, the statement move to ; appears to be no problem: the first expression is terminated by the keyword "to", the second by ";". But suppose "to" has another meaning, as the name of a constant or variable? In this case, move 2 + to to to; is a legal statement. For all these reasons, Inform actually detects the end of an expression at lexical level 4 by watching the context carefully: for instance, in move 4 * banana to ... when "4 * banana" has been passed, the next token is expected to be an open bracket, an infix or postfix operator, or (perhaps) a comma. Since the next token is none of these, the expression has ended. Note that this decision can be taken without knowing whether "to" is a keyword or a symbol. From a grammatical point of view the worst design flaw in the Inform language is that array initialisations (either in Array directives or object property values) consist of sequences of expressions given without any separator in between. For example, Array X --> 2 4 6 * MAX_FROGS 45 ; This is arguably a convenient shorthand. Unfortunately, it falls prey to that predator of all computing grammars, unary minus. The directive Array X --> 10 -5; is ambiguous. Is it an array of 10-5 = 5 entries, or an array of two entries, 10 and -5? In Inform 5 there was no problem since constants were not allowed to contain operators anyway (and negative numbers were represented by single tokens, which is not true in Inform 6): the directive can only mean that there are two entries, 10 and -5. In Inform 6 the other answer is true, because the parser always makes the longest match it can. Since this may mean that some Inform 5 code needs changing in a few places, a warning that "this is ambiguous" is issued for unbracketed minus signs used in array initialisations. (Illustrating the problem further, if the user tries to clarify things with Array A --> 10 (-5); then this is parsed under the normal rules as the function call 10(-5); therefore another constant-context rule is used to interpret brackets as always referring to subexpressions, not function calls.) Similar difficulties at first seem to occur with < and >, and with << and >>. Does refer to the pair (Action, -1) or the triple (Action, 3, -4)? (It is now obvious that the correct syntax design would be to insist on a comma in between 3 and -4.) Fortunately, though, both Informs 5 and 6 apply a "make the longest expression possible" principle to this case, so they agree that the answer is (Action, -1). On the same grounds, the statement is understood as applying ++ to x, not to y, because the "longest expression possible" when parsing the first expression is "x ++". A similar slight exception is made to the rules parsing open-brackets in action statements, so that is not parsed as having one argument, the result of the function call firstobj(4*counter). 6.5 Constant folding ---------------- Recall that the emitter takes as input a stream of etokens from the shift-reduce parser, consisting of the expression re-ordered into RPN. When the whole stream has been received (i.e., when it has received ENDEXP_TT), the emitter has to produce an operand as "answer". For instance, if the emitter receives just the operand 56, then the answer is 56. More often, the answer may be a special value (written above as E1, E2, ...) meaning "the numerical value has to be worked out by working through this tree". (E1 is actually a new type of etoken, of type TREE_NODE_TT.) If this were done strictly, then the original text 4 + 7 would result in the emitter producing the answer E1, where E1 refers to the expression tree + / \ 4 7 and this is not a helpful answer, for two reasons: it means the value couldn't be used as a constant (since it seems to need Z-code to compiled to work it out), and even if it's used in code, there's not much point in compiling the Z-code instruction @add 4 7 -> sp; just to put the value 11 on the stack. Therefore, if the emitter finds an operator that it knows how to calculate (such as +), whose operands are all constants (and which are not unknown for linking reasons), it eliminates the sub-expression in favour of the result. This is called "constant folding": the constant value 4+7 is folded up into 11. Two issues arise: firstly, arithmetic must be done exactly as it would be done in the Z-machine, that is, as signed 16-bit arithmetic. (Otherwise constants may fold differently in one port of Inform or another.) Secondly, constant folding cannot safely be performed on a marked value. (For otherwise: consider what might happen to the expression SODS - LAW, at a time when neither SODS nor LAW have been defined. Both are evaluated as constants with value their symbol numbers and marker SYMBOL_MV: the result is folded to the difference between their symbol numbers, and the marker is lost: i.e., the result is nonsense.) The emitter is able to fold the following operators: + - (unary or binary) * / % & | ~ == ~= > < >= <= && || ~~ The inclusion of the conditionals here is quite important: it enables directives like Iftrue #version_number == 3; to work, since the expression "#version_number == 3" can be parsed in a constant context. The inclusion of unary minus is also important, as it is only by folding this that the text "-45" can be parsed in constant context. 6.6 Checking lvalues and simplifying the parse tree ----------------------------------------------- If the emitter produces a parse tree, then it's now modified according to several rules. The aim is to remove any ambiguities in the tree (for instance, "=" does several different things according to what its left operand is) and make a final correctness check (for instance, this is where "45 = 12" results in an error message). Firstly, the tree is traversed to find instances where a value is used where a condition was expected. (If the expression is in a condition context, the top node is expected to be a condition; the children of a logical operator &&, || or ~~ are expected to be conditions.) The configuration is replaced by <~= 0 condition operator> | If the has root node the "=" operator, a warning is printed out, since it seems likely that if (x = 4) ... (which would invariably be found true, since the value of "x=4" is 4 which is non-zero) is a mistype for if (x == 4) ... A second, depth-first, traversal is then made to remove usages of ~~, using de Morgan's laws: ~~(x && y) = ~~x || ~~y ~~(x || y) = ~~x && ~~y ~~(x == y) = x ~= y ~~(x >= y) = x < y and so on. (Note that, internally, every operator etoken has a corresponding negated operator etoken.) One ambiguity remaining in the tree is that the operators ".", ".#" and ".&" act on both common and individual properties (that is, properties handled by the Z-machine's architecture directly, and properties handled by the run-time veneer routines). The decision about which is intended is made here, by looking at the right operands of the operators. (If it's a common property number, or an individual property number, the decision is obvious; if it's a variable or some compound expression, then we decide in favour of individual properties (the run-time veneer routines can patch things up if this is a wrong decision).) So a traversal is made to carry out the following transformation: . / \ becomes / \ or / \ a b a b a b (Since the veneer routines are themselves written in Inform code, a slightly different translation rule applies to them: "." always means and so on, while three "secret" operators are provided, "..", "..&" and "..#", to mean and so on. Outside of the veneer, use of these operators is illegal.) In the same traversal, function calls are distinguished from sent messages: / \ \ ... / | | \ ... c d becomes a b c d or / \ a b The final traversal performs "lvalue checking". Five operators act directly on storage objects (such as variables) rather than values: these are = ++ -- ++ -- where has to be a reference to a storage object. There are up to five kinds of storage object in the tree: local/global variable VARIABLE_TT common property value of an object subexpression individual property value of an object subexpression entry in byte -> array -> subexpression entry in word --> array --> subexpression This makes 25 combinations. As well as checking that what ought to be an lvalue is indeed one of these five possibilities, the traversal also rewrites the tree explicitly with one of 25 operators. For instance: = / \ / | \ -> v becomes a b v / \ a b and so on. Thus, the end of this process is a tree representing a syntactically correct expression (if necessary having mended any errors in the source code's description of the expression), which has a minimal number of logical operators and no negation operators, and in which the action to be taken by any node does not depend on what its children are. 6.7 Summary of parse tree output ---------------------------- It is time to formally specify the input and output of the main routine in "expressp.c": assembly_operand parse_expression(int context) takes, as argument, one of the seven *_CONTEXT constants: QUANTITY_CONTEXT the default: the result may be any quantity VOID_CONTEXT the expression is used as a statement, so that its value will be thrown away and it only needs to exist for any resulting side-effects (used in function calls and assignments) CONDITION_CONTEXT the result must be a condition CONSTANT_CONTEXT the result must be known now (at compile time) ASSEMBLY_CONTEXT like QUANTITY_CONTEXT, but with the '->' operator forbidden (this is needed for assembly language to indicate store destinations) FORINIT_CONTEXT like VOID_CONTEXT, but with the '::' operator forbidden at the top level (needed to resolve ambiguity in grammar) ARRAY_CONTEXT like CONSTANT_CONTEXT, but with different rules to resolve whether a minus sign represents unary or binary minus (needed to resolve ambiguity in grammar) The return value is the result of the expression. This is an assembly operand, the type of which indicates what has happened: LONG_CONSTANT_OT, SHORT_CONSTANT_OT the result is this number VARIABLE_OT the result is in this global/local variable OMITTED_OT there is no resulting value (for instance, this happens if the expression was successfully parsed in void context) EXPRESSION_OT the result can only be determined by running the code generated from this parse tree If an error has occurred in the expression, from which recovery was not possible, then the return is (short constant) 0. This should minimise the chance of a cascade of further error messages. The type EXPRESSION_OT is not a valid Z-machine assembly operand type, and it's only used here and in the code generator "expressc.c". The value of such an operand is an index N, indicating that the root node of the tree is in Nth position in the expression-tree-array ET. The expression-tree-array ET[0], ET[1], ... is an array of "nodes", each of which is a structure as follows: { /* Data used in tree construction */ int up, down, right; int operator_number; /* Only meaningful for non-leaves */ assembly_operand value; /* Only meaningful for leaves */ /* Attributes synthesised during code generation */ ... } The numbers ET[n].up, ET[n].down and ET[n].right are node numbers for the node above, the leftmost node below, and the next child to the right of the same parent: they hold -1 if there is no node in that position. A "leaf" is a node for which ET[n].down == -1. Leaves hold operands, and other nodes hold operators. (Note that leaf operands must be of operand type SHORT_CONSTANT_OT, LONG_CONSTANT_OT or VARIABLE_OT: they must be honest Z-machine operands.) Nothing distinguishes the root node except that its number happens to be held in an EXPRESSION_OT operand value somewhere. A brief word on deallocation: clearly these parse trees need to survive intact until code generation occurs for them. No elaborate system is provided here, since the syntax analyser always generates code for any expressions it reads while it's still parsing the same syntax line as the expression was parsed on. Each time a new syntax line is reached, "expressp.c" is notified and it then wipes the ET[n] array empty again.