4 Lexical analysis ---------------- 4.1 Aim and structure of the lexer ------------------------------ The task of the lexical analyser, or "lexer" for short, is to translate the text it reads into a sequence of "tokens". The aim is that the rest of the compiler should work with the tokens and need never look at the original text again; Inform mainly achieves this aim. Each token represents a sequence of characters, called a "lexeme", which is indivisible from a syntactic point of view. For instance, the text $12+duck.&feathers contains five atomic pieces of syntax: Token Lexeme $12 + duck .& feathers The Inform lexer has three, or perhaps four levels: Level 1: reading in text, filtering out control characters, etc. Level 2: a context-free tokeniser Level 3: a context-dependent process of deciding whether any identifier names found are keywords or names made up by the programmer To make a fairly revolting analogy: when parsing dinner, lexical analysis is done with the teeth, and syntax analysis with the stomach. But with programming languages it is difficult to give a very precise point at which one ends and the other begins: and in Inform, the "perhaps level 4" of the lexer occurs when tokens are sorted out more finely in the expression parser (is this an operator? is it a constant value? is it a variable name?), which will be documented as part of the syntax analyser in this manual. At any rate, "lexer.c" and "symbols.c" contain levels 1 to 3 only. For details of some standard algorithms in compiler theory, the account below and in section 5 gives page references to "ASU": this is Aho, Sethi and Ullman (1986), "Compilers: Principles, Techniques and Tools", the so-called "Dragon book". It's called this because the front cover features a knight armoured in various algorithms confronting a dragon labelled "Complexity of compiler design": in the case of Inform, though, the dragon might reasonably be twice as big and labelled "Ignorance of designer who originally chose the syntax for Inform". Compiler-making tools like "lex", "yacc" and "bison" are difficult to apply to this language, since Inform doesn't have a convenient LALR(1) grammar (more like LALR(4), in fact). In any case the lexical and syntax analysers are hand-coded for speed, which is generally considered to produce code twice as fast as that generated by mechanical tools like "yacc". 4.2 Level 1: lexical blocks and buffered input ------------------------------------------ The lexer can take input from two sources: from the source code files which are being compiled, or from a null-terminated string somewhere in the compiler. (The latter is used when compiling the "veneer" of run-time routines.) A "lexical block" is a piece of text which is individually named and line-counted (so that error messages can indicate where an error has occurred with reference to the original source). Each single file of source code (the original file and each Include file) is a lexical block in its own right. Likewise, if the lexer is reading from an internal string, then that string is a lexical block. The LexicalBlock structure holds data about the position inside such a block. Note that several may be needed at once: if one source file includes another, which includes another, then three different LexicalBlocks will contain useful information. However, information about the files currently being worked from is stored in a different structure way, and not in LexicalBlock structures. For efficiency reasons, we can't read characters out of the files one at a time: instead we read them in 4096-byte chunks into a buffer. Note that this means that the read-position of a file will be well ahead of the read-position of the lexer within it. It may have been closed before the lexer is halfway through its buffer. A further complication is that it's natural to use a stack structure to hold the files being read from: include file 2 <-- reading from this include file 1 <-- still open but not being read from main source code file <-- still open but not being read from and it is also natural to use a stack structure to hold the LexicalBlocks being worked through: LB for include 2 LB for include 1 LB for main source code file Despite the apparent similarity here, we can't combine this into one stack, since "include file 2" will have been pushed off the files stack before its LB is pushed off the LB stack. Since further Include directives may occur at any point in the current LB, the stacks can be very different. Otherwise level 1 is simple and fast. Note that characters are fed up into level 2 through a "pipeline" (see the next section for what this means and why), and are also filtered. Inform can read source files in any plain ASCII or any of the ISO 8859 standard ASCII extensions (five variants with accented Latin characters, plus Cyrillic, Arabic, Hebrew, Greek). The filtering process removes any character codes undefined in the current ISO setting, and normalises new-line and spacing characters: 00 remains 0 (meaning "end of file") TAB becomes SPACE 0d becomes 0a, i.e., '\n' other control characters become '?' 7f becomes '?' 80 to 9f become '?' a0 (ISO "non-breaking space") becomes SPACE ad (ISO "soft hyphen") becomes '-' any character undefined in ISO between a0 and ff is mapped to '?' (Here "ISO" means "the current ISO setting", which can be 0, meaning plain ASCII only: in this mode any character value of 80 to ff will be filtered to '?'.) The filtering ensures that (i) no error message can contain an unprintable character and (ii) the user cannot type a character outside a standard ISO set and which might work on one platform but not on another. Note that the 0 character is only fed upwards into level 2 when the entire lexical source has run out: that is, all the source files are exhausted, or else the string being analysed has run out. 4.3 Level 2: breaking text into lexemes ----------------------------------- The input to level 2, then, is a stream of characters: and its output is a stream of tokens, each of which represents a sequence of characters called a "lexeme". Some definitions of terms used in the Inform source code. There is a fixed set of "separators", all of them sequences of 1 to 3 mainly non-alphabetic characters: -> --> -- - ++ + * / % || | && & ~~ ~= ~ == = >= > <= < ( ) , .& .# ..& ..# .. . :: : @ ; [ ] { } $ ?~ ? #a$ #n$ #r$ #w$ ## # An "identifier" is a sequence of one or more characters from the set: A B C D ... Z a b c ... z 0 1 2 3 4 5 6 7 8 9 _ which does not begin with 0 to 9. The lexer makes the longest lexeme it possibly can for any particular token, so the next character after any identifier will not be one of the set above. The lexemes representing numbers are: (a) one or more chars from the set 0 1 2 3 4 5 6 7 8 9 (b) $ followed by one or more chars from 0 1 2 3 4 5 6 7 8 9 A B C D E F (hexadecimal numbers) (b) $$ followed by one or more chars from 0 1 (binary numbers) Note that "-67" is not a lexeme for a single token: it is broken into two lexemes, "-" and "67". A ", followed by any number of characters and then another ", is a single lexeme. Similarly for '. Finally, as a special case, the six separators #a$ #n$ #r$ #w$ ## # are expected to be followed by an identifier. For example, "##Take" is a single lexeme, and is not divided into "##" and "Take". (Except that #n$ can be followed by any character, so that e.g. #n$1 is a single token, representing "the new dictionary word '1'" in the language.) To sum up, the lexical analyser expects legal source code to contain: "comments": sequences beginning with a ! character and ending with a new-line or the end of the file or string containing it any of the lexemes above "white space", that is, characters in the set space tab new-line When breaking down text into lexemes (and throwing away the white space and comments), the lexer needs 3 characters of look-ahead. That is, it can decide definitely what lexeme character N belongs to provided that it knows what characters N+1, N+2 and N+3 are. (The figure 3 occurs because that's the worst case arising: deciding whether character N is the start of a separator in the form "#a$" followed by an identifier character.) Because of this, level 1 maintains a "pipeline" of four variables to hold the current character and the next three on the way. By looking ahead at the pipeline as needed, level 2 decides what to do with the current character and then advances one character: the three waiting in the pipeline shuffle up and a new character is read into the last place in the pipeline. The advantage of maintaining a pipeline is that the lexer never needs to undo any decisions and go backward through the text. One token is output for each lexeme found, and when the lexer runs out of input altogether (when all the source files are finished, or when the string being analysed has run out) a special token, called "end of file", is output. Thus the lexer's output is a sequence of tokens from the list: numbers strings in " quotes strings in ' quotes separators identifiers end-of-file As will be seen in the next section, identifiers are analysed further before being passed out of the lexer. Except for the problem of deciding what an identifier means, the lexer is "context-free": it needs to know nothing about the syntax of the Inform language, what kind of code it is currently analysing, etc. There is a standard state-machine algorithm for writing such a lexer: see ASU, p. 98. The tricky part is to efficiently store a transition table for the states involved, which will be neither so small that slow code is required to read it, nor so large that it takes up an unacceptable amount of memory. Here the "tokeniser_grid" stores most of a transition table: this is an algorithm originally suggested to me by Dilip Sequeira, similar to that used by S. C. Johnson's "yacc" lexical analyser. 4.4 Representation of tokens ------------------------ Tokens are internally stored as quadruples: typedef struct token_data_s { char *text; int value; int type; int marker; } token_data; though the lexer is not responsible for writing "marker" values, which are the concern of the syntax analyser. The "type" identifies which kind of token it is (this value must be one of the *_TT set #define'd in "header.h"): for example, DQ_TT means "a double-quoted string". The "value" is a number whose meaning depends on the type. For example, in the case of type DQ_TT, the number has no meaning and is not used; in the case of type SEP_TT (for "separator"), the number gives the index in the above table of possible separators, thus identifying which separator it is. "text" is a pointer to a null-terminated string containing the lexeme. There are two reasons this is needed: firstly, so that comprehensible error messages can be printed out from higher up in the compiler, and secondly because in the case of DQ_TT and SQ_TT the text may need to be compiled into Z-text format and added to the static strings area, to the Z-code area or to the dictionary. The decision on whether the string should be compiled and if so, what to, cannot be taken until code generation time. Unfortunately code generation time may not be for a while, and this raises a problem: we clearly need to keep lexeme text in memory for a while, but it would be very costly on memory to keep the text of every token in memory permanently. When is it safe for the lexer to throw a lexeme away? One solution would be for a formal system by which the code generator explicitly deallocated the space, but this is too bureaucratic and too risky (suppose we miss 0.1% of these? Memory will slowly clog up and begin to cause odd errors on large compilation runs). Instead, the lexer simply writes its text cyclically into a large buffer. Code generation almost always takes place before the lexer has got any more than 100 or so bytes of text ahead; since the buffer is a good 10K long, there is a generous safety margin. 4.5 Level 3: identifying identifiers -------------------------------- A "keyword" is an identifier which has a meaning understood by the Inform compiler: for example the lexeme while is understood (in some contexts) to refer to the "while" statement, rather than being a name coined by the programmer for some variable or object. In most small languages such a lexeme would always be a keyword and never a name. (This is the concept of "reserved word": a word reserved for the use of the compiler, forbidden as a name.) For example, the fairly-large language C++ has 48 such keywords. Inform is unable to reserve keywords (that is, forbid their use as names) for two reasons: firstly, because there are 281 of the damn things (and 116 of them are only significant in lines of assembly language: why forbid the use of these to non-assembly-language programmers?), and secondly because it is important to the language that some keywords definitely should be the same as some object names. "Class" is both a keyword meaning "the class-making directive" and a name meaning "the object representing the class of all classes". So the decision "is this identifier intended to be a keyword or intended to be the name of something?" relies on the context. What the lexer does is to divide the 281 keywords into twelve groups. The syntax analyser switches these groups on or off as it finds its way through the program. For example, the group "opcode_names" is only enabled (i.e., switched on) after an @ character has been read at the beginning of a statement. A complication is that case is sensitive for some of these groups and not others. For example, "IFDEF and "Ifdef" both match the directive name "ifdef", but "WHILE" and "While" do not match the statement name "while". The rules in fact are: When matching... Example Sensitive? Language keywords: directive names Class No keywords used within directives with Yes statement names while Yes keywords used within statements else Yes condition names hasnt Yes built-in function names random Yes built-in constant names #code_offset Yes Assembly language keywords: opcode names put_prop Yes customised opcode names @"1OP:4S" Yes the stack pointer name sp Yes Names for variables, objects, etc. lantern No Names for actions Take No Moreover, in contexts where local variables are allowed, they take precedence over the rest. (For example, you could define a local variable called "random", and for the duration of that routine the word "random" would never be recognised as the keyword for the system function random().) Each keyword group has its own token type. The token value for a keyword token is the index within the group: see "header.h" for #define's for all of these indices. Finally, there are times when the syntax analyser just wants the raw text of an identifier, and doesn't want it identified (for efficiency reasons: this prevents it from being entered into the symbols table). If the dont_enter_into_symbols_table flag is set, the lexer simply returns a token of type DQ_TT as though the name had been found in double-quotes. 4.6 Hash-coding and comparing names ------------------------------- It would be computationally very expensive indeed to decide whether string X is a keyword or not by carrying out direct string comparisons with all 281 keywords. Instead, hash-coding is used. The idea is to perform a fast operation on X whose result is some number from, say, 1 to 100: this is the hash code of X. If X is the same string as K, then K must have the same hash code. So we organise the keywords into 100 different groups, the group with hash code 1, with hash code 2, etc.: then if h is the hash code of X, we only need to compare X with the keywords in group h. For this to be any use, the operation performed has to be chosen so that the keywords are sorted out into groups with about equal numbers in each. Many sensible "hash functions" are known (see ASU, p. 437 for experimental performance data). Inform's choice is simple but effective: it uses multiplication rather than bit-shifting, but then on modern CPUs multiplication is not very expensive. The algorithm is what ASU would call "X 30011": extern int hash_code_from_string(char *p) { unsigned int32 hashcode=0; for (; *p; p++) hashcode=hashcode*30011 + case_conversion_grid[*p]; return (int) (hashcode % HASH_TAB_SIZE); } where "hashcode" is meant to overflow repeatedly. Note that 30011 is a prime number. (case_conversion_grid[*p] is just a fast way to convert all lower case letters to upper case ones, since we want case to be insignificant when calculating hash codes.) It doesn't matter if this algorithm behaves differently on different machines (depending on how the CPU copes with overflows): all that matters is it delivers a reasonable spread of hash values. The hash code range is 0 to HASH_TAB_SIZE-1: this is a memory setting which by default is 512. Reducing this by any substantial amount causes a significant slowing down of Inform. 4.7 The symbols table ----------------- If an identifier, in context, isn't a keyword then it is called a "symbol": this basically means "a name for something created by the programmer" (except that Inform creates a few symbols automatically as well). The section "symbols.c" keeps a table of all the symbols which have turned up so far. (It often picks up a few extraneous names of keywords, too, when the lexer has been looking ahead in the wrong context: but this doesn't matter and it's faster to tolerate the slight waste of memory.) The symbols table is organised into HASH_TAB_SIZE-1 linked lists, each of which is kept in alphabetical order. A single operation is provided for searching it: symbol_index(), which works out the hash code of the string S supplied (unless it's already known, in which case it need not work it out again) and looks through the linked list for that hash code until the first letter matches. (Thus, the only full string comparisons made are between S and those strings in the table which have the same hash code and the same initial letter.) If S is not present, it is inserted into the table. In addition to its text, each symbol in the table has four pieces of data attached: its type, its value, the line on which it was assigned and its flags. The symbol with index i has: symbs[i] text stypes[i] type char svals[i] value int32 sflags[i] flags word int slines[i] line number int32 (It would be more elegant to store this as an array of structures, but the symbols table is a major consumer of memory, so we can't afford to let lazy compilers waste it by aligning the fields of such a structure at 4 byte intervals. Hence, five separate arrays.) Types and values are assigned by the syntax analyser: the type indicates what kind of thing the symbol is a name of (a routine, a label, an object, etc.), and must always be one of the #define names *_T (see "header.h"). The value holds some number indicating which particular thing the symbol is a name of (a routine address, an object number, etc.). The line number is set when the symbol is first used, and then reset when it is first assigned a value (if ever). The information is stored as file-number*0x10000 + line-number where file-number is an index into the InputFiles array maintained by "files.c". The flags word holds (presently) 15 different flags, for different properties that the symbol name can have. These flags have #define names in the form *_SFLAG. The token for a symbol has type SYMBOL_TT and value equal to the index i of the symbol within the symbols table. 4.8 Token lookahead: the circle --------------------------- The output of the lexer is sent, one token at a time, when the syntax analyser calls get_next_token(). The syntax analyser is a predictive parser which makes mistakes and has to backtrack (that is, it reads in token X, guesses that it might be looking at XY and reads the next token to see... but reads a Z instead, and has to go back to X and try a different theory). So the syntax analyser needs to be able to go backwards, as well as forwards, in the stream of tokens. This means that the lexer needs to remember its last N tokens, where N is the maximum number of backward steps the syntax analyser ever takes in one go. The number N is the amount of "lookahead", since one could also think of it as the maximum tokens one needs to look ahead of the current position to be sure of what is happening. These tokens are therefore stored in a "circle" of N+1 tokens: when get_next_token() is called, the lexer works out the next token, writes it into the circle and moves on one place clockwise; when put_token_back() is called, the lexer moves back one place anticlockwise, so that the next get_next_token() will send back the token in that position rather than work out a new one. However, the interpretation of identifiers as keyword or symbol depends on the current context, and this may have changed since the token was first calculated. Therefore a numerical representation of the context is stored in the circle too, so that if this has changed and if the token is an identifier then it is re-identified under the context which now prevails. 4.9 Summary of the lexer's output ----------------------------- To summarise, the lexer converts source text to a stream of tokens of types as follows: -------------------------------------------------------------------------- Type Lexeme Value -------------------------------------------------------------------------- NUMBER_TT literal number in the number decimal, hex or binary DQ_TT string extracted from (none) double-quotes (or identifier name: see above for when) SQ_TT string extracted from (none) single-quotes SEP_TT separator a #define'd *_SEP value EOF_TT (end of source) (none) SYMBOL_TT identifier assumed index in the symbols to be a name table LOCAL_VARIABLE_TT local variable name 1 to 15, its Z-machine variable number STATEMENT_TT statement name a #defined *_CODE value SEGMENT_MARKER_TT with/has/class etc. a #defined *_SEGMENT value DIRECTIVE_TT directive name a #defined *_CODE value CND_TT named conditional a #defined *_COND value SYSFUN_TT built-in function a #defined *_SYSFUN value OPCODE_NAME_TT opcode name a #defined *_zc value (i.e., an internal opcode number) MISC_KEYWORD_TT keyword like "char" used a #defined *_MK value in statement syntax DIR_KEYWORD_TT keyword like "meta" used a #defined *_DK value in directive syntax TRACE_KEYWORD_TT keyword used in syntax a #defined *_TK value of "trace" directive SYSTEM_CONSTANT_TT system #constant name a #defined *_SC value -------------------------------------------------------------------------- Note that other token types do exist, but are set in "level 4": i.e., within the expression parser.