DM4 §28: How nouns are parsed

§28 How nouns are parsed

The Naming of Cats is a difficult matter,
It isn't just one of your holiday games;
You may think at first I'm as mad as a hatter
When I tell you, a cat must have THREE DIFFERENT NAMES.
— T. S. Eliot (1888–1965), The Naming of Cats

Suppose we have a tomato defined with

name 'fried' 'green' 'tomato',

but which is going to redden later and need to be referred to as “red tomato”. The name property holds an array of dictionary words, so that

(tomato.#name)/2 == 3
tomato.&name-->0 == 'fried'
tomato.&name-->1 == 'green'
tomato.&name-->2 == 'tomato'

(Recall that X.#Y tells you the number of -> entries in such a property array, in this case six, so that X.#Y/2 tells you the number of --> entries, in this case three.) You are quite free to alter this array during play:

tomato.&name-->1 = 'red';

The down side of this technique is that it's clumsy, when all's said and done, and not so very flexible, because you can't change the length of the tomato.&name array during play. Of course you could define the tomato

with name 'fried' 'green' 'tomato' 'blank.' 'blank.' 'blank.'
          'blank.' 'blank.' 'blank.' 'blank.' 'blank.' 'blank.'
          'blank.' 'blank.' 'blank.' 'blank.' 'blank.' 'blank.',

or something similar, giving yourself another (say) fifteen “slots” to put new names into, but this is inelegant even by Inform standards. Instead, an object like the tomato can be given a parse_name routine, allowing complete flexibility for the designer to specify just what names it does and doesn't match. It is time to begin looking into the parser and how it works.

· · · · ·

The Inform parser has two cardinal principles: firstly, it is designed to be as “open-access” as possible, because a parser cannot ever be general enough for every game without being highly modifiable. This means that there are many levels on which you can augment or override what it does. Secondly, it tries to be generous in what it accepts from the player, understanding the broadest possible range of commands and making no effort to be strict in rejecting ungrammatical requests. For instance, given a shallow pool nearby, “examine shallow” has an adjective without a noun: but it's clear what the player means. In general, all sensible commands should be accepted but it is not important whether or not nonsensical ones are rejected.

The first thing the parser does is to read in text from the keyboard and break it up into a stream of words: so the text “wizened man, eat the grey bread” becomes

wizened / man / , / eat / the / grey / bread

and these words are numbered from 1. At all times the parser keeps a “word number” marker to keep its place along this line, and this is held in the variable wn. The routine NextWord() returns the word at the current position of the marker, and moves it forward, i.e., adds 1 to wn. For instance, the parser may find itself at word 6 and trying to match “grey bread” as the name of an object. Calling NextWord() returns the value 'grey' and calling it again gives 'bread'.

Note that if the player had mistyped “grye bread”, “grye” being a word which isn't mentioned anywhere in the program or created by the library, then NextWord() returns 0 for ‘not in the dictionary’. Inform creates the dictionary of a story file by taking all the name words of objects, all the verbs and prepositions from grammar lines, and all the words used in constants like 'frog' written in the source code, and then sorting these into alphabetical order.

▲ However, the story file's dictionary only has 9-character resolution. (And only 6 if Inform has been told to compile an early-model story file: see §45.) Thus the values of 'polyunsaturate' and 'polyunsaturated' are equal. Also, upper case and lower case letters are considered the same. Although dictionary words are permitted to contain numerals or typewriter symbols like -, : or /, these cost as much as two ordinary letters, so 'catch-22' looks the same as 'catch-2' or 'catch-207'.

▲▲ A dictionary word can even contain spaces, full stops or commas, but if so it is ‘untypeable’. For instance, 'in,out' is an untypeable word because if the player were to type something like “go in,out”, the text would be broken up into four words, go / in / , / out. Thus 'in,out' may be in the story file's dictionary but it will never match against any word of what the player typed. Surprisingly, this can be useful, as it was at the end of §18.

· · · · ·

Since the story file's dictionary isn't always perfect, there is sometimes no alternative but to actually look at the player's text one character at a time: for instance, to check that a 12-digit phone number has been typed correctly and in full.

The routine WordAddress(wordnum) returns a byte array of the characters in the word, and WordLength(wordnum) tells you how many characters there are in it. Given the above example text of “wizened man, eat the grey bread”:

WordLength(4) == 3
WordAddress(4)->0 == 'e'
WordAddress(4)->1 == 'a'
WordAddress(4)->2 == 't'

because word number 4 is “eat”. (Recall that the comma is considered as a word in its own right.)

▲ The parser provides a basic routine for comparing a word against the texts '0', '1', '2', …, '9999', '10000' or, in other words, against small numbers. This is the library routine TryNumber(wordnum), which tries to parse the word at wordnum as a number and returns that number, if it finds a match. Besides numbers written out in digits, it also recognises the texts 'one', 'two', 'three', …, 'twenty'. If it fails to recognise the text as a number, it returns −1,000; if it finds a number greater than 10,000, it rounds down and returns 10,000.

· · · · ·

To return to the naming of objects, the parser normally recognises any arrangement of some or all of the name words of an object as a noun which refers to it: and the more words, the better the match is considered to be. Thus “fried green tomato” is a better match than “fried tomato” or “green tomato” but all three are considered to match. On the other hand, so is “fried green”, and “green green tomato green fried green” is considered a very good match indeed. The method is quick and good at understanding a wide variety of sensible texts, though poor at throwing out foolish ones. (An example of the parser's strategy of being generous rather than strict.) To be more precise, here is what happens when the parser wants to match some text against an object:

If the object provides a parse_name routine, ask this routine to determine how good a match there is.
If there was no parse_name routine, or if there was but it returned −1, ask the entry point routine ParseNoun, if the game has one, to make the decision.
If there was no ParseNoun entry point, or if there was but it returned −1, look at the name of the object and match the longest possible sequence of words given in the name.

So: a parse_name routine, if provided, is expected to try to match as many words as possible starting from the current position of wn and reading them in one at a time using the NextWord() routine. Thus it must not stop just because the first word makes sense, but must keep reading and find out how many words in a row make sense. It should return:

0	if the text didn't make any sense at all,
k	if k words in a row of the text seem to refer to the object, or
−1	to tell the parser it doesn't want to decide after all.

The word marker wn can be left anywhere afterwards. For example, here is the fried tomato with which this section started:

parse_name [ n colour;
    if (self.ripe) colour = 'red'; else colour = 'green';
    while (NextWord() == 'tomato' or 'fried' or colour) n++;
    return n;
],

The effect of this is that if tomato.ripe is true then the tomato responds to the names “tomato”, “fried” and “red”, and otherwise to “tomato”, “fried” and “green”.

As a second example of how parse_name can be useful, suppose you define:

Object -> "fly in amber"
  with name 'fly' 'in' 'amber';

If the player then types “put fly in amber in hole”, the parser will be thrown, because it will think “fly in amber in” is all just naming the object and then it won't know what the word “hole” is doing at the end. However:

Object -> "fly in amber"
  with parse_name [;
           if (NextWord() ~= 'fly' or 'amber') return 0;
           if (NextWord() == 'in' && NextWord() == 'amber')
               return 3;
           return 1;
       ];

Now the word “in” is only recognised as part of the fly's name if it is followed by the word “amber”, and the ambiguity goes away. (“amber in amber” is also recognised, but then it's not worth the bother of excluding.)

▲ parse_name is also used to spot plurals: see §29.

• EXERCISE 71
Rewrite the tomato's parse_name to insist that the adjectives must come before the noun, which must be present.

• EXERCISE 72
Create a musician called Princess who, when kissed, is transformed into “/?%?/ (the artiste formerly known as Princess)”.

• EXERCISE 73
Construct a drinks machine capable of serving cola, coffee or tea, using only one object for the buttons and one for the possible drinks.

• EXERCISE 74
Write a parse_name routine which looks through name in just the way that the parser would have done anyway if there hadn't been a parse_name in the first place.

•▲ EXERCISE 75
Some adventure game parsers split object names into ‘adjectives’ and ‘nouns’, so that only the pattern ‹0 or more adjectives› ‹1 or more nouns› is recognised. Implement this.

• EXERCISE 76
During debugging it sometimes helps to be able to refer to objects by their internal numbers, so that “put object 31 on object 5” would work. Implement this.

•▲ EXERCISE 77
How could the word “#” be made a wild-card, meaning “match any single object”?

•▲▲ EXERCISE 78
And how could “*” be a wild-card for “match any collection of objects”? (Note: you need to have read §29 to answer this.)

• REFERENCES
Straightforward parse_name examples are the chess pieces object and the kittens class of ‘Alice Through the Looking-Glass’. Lengthier ones are found in ‘Balances’, especially in the white cubes class. •Miron Schmidt's library extension "calyx_adjectives.h", based on earlier work by Andrew Clover, provides for objects to have “adnames” as well as “names”: “adnames” are usually adjectives, and are regarded as being less good matches for an object than “names”. In this system “get string” would take either a string bag or a ball of string, but if both were present would take the ball of string, because “string” is in that case a noun rather than an adjective.