~vdupras/tumbleforth

a4569be53e48237520cbbe7cc7f505d65f8976f5 — Virgil Dupras 10 months ago db92f56
01-duskcc/10-beast: fix tokenizer for newer Dusk

I hadn't revisited this article in a while, but the tokenizer is broken!

[c]? has been changed to cidx and :putback has been replaced with "stepback",
which can't be directly used in this example.
2 files changed, 22 insertions(+), 17 deletions(-)

M 01-duskcc/10-beast.md
M 01-duskcc/10-beast/tok.fs
M 01-duskcc/10-beast.md => 01-duskcc/10-beast.md +19 -15
@@ 89,7 89,7 @@ this:
    f" foo.c"
    console :self file :spit \ Spits the contents of the file to the console

`console` is a global [structbind][struct][^1] to the [Pipe][io] struct and
`console` is a global [structbind][struct][^1] to the [IO][io] struct and
`file` is a global structbind to the [File][file] struct.

`console` is not of a particular interest to us at the moment, so we'll set it


@@ 215,22 215,26 @@ challenges to us:
3. When the first character we meet is a symbol, stop accumulating and return
it as a token.

For the first challenge, let me show you the word `[c]? ( c a u -- idx )`:
For the first challenge, let me show you the word `cidx ( c a u -- i? f )`:

    create symbols ," +(){};,"
    : isSym? ( c -- f ) symbols 7 [c]? 0>= ;
    : isSym? ( c -- f ) symbols 7 cidx dup if nip then ;

The word `[c]?` finds the index of character `c` in memory range starting at
address `a` with a length `u`. We've created such a memory range in `symbols`
above, which allows us to easily determine whether a given character is in that
range. 
The word `cidx` tries to find the index of character `c` in memory range
starting at address `a` with a length `u`. We've created such a memory range in
`symbols` above, which allows us to easily determine whether a given character
is in that range. 

This `cidx` signature is a bit strange because of the `i?`. This is a word with
a dynamic stack signature. If the character is found, `f=1` and `i` is the
found index. Otherwise, `f=0` and `i` *is not present*. Because all we care
about in `isSym?` is `f`, we conditionally drop the `i`.

For the second challenge, we could save the last character into a variable
instead of dropping it and, on the following `nextt` call, pick it up instead
of calling `:getc`, but the I/O subsystem *already* has such a system! The
method is called `:putback ( c hdl -- )`[^6]. When that's called, the next read
operation will transparently include it. It sounds like nothing, but it saves a
lot of complicated logic in some situations.
of calling `:getc`, but since we're dealing with a file, we can also rewind the
file's position by 1 character, which has the same effect. This is what we do
with `file pos 1- file :seek`.

For the third challenge, we need to make a symbol check before the loop to
handle this case and not enter the loop when we encounter it.


@@ 243,7 247,8 @@ This could give use a `nextt` that looks like this:
      dup isSym? if swap c! 1 else begin ( a c )
          dup boundary? not while ( a c )
          swap c!+ file :getc repeat ( a c )
        file :putback ( a ) buf - 1- then ( length )
        drop file pos 1- file :seek ( a )
        buf - 1- then ( length )
      dup buf c! ( length ) if buf else 0 then ;

With this new code, `tokstype` will spit the right tokens.


@@ 272,7 277,7 @@ be used as temporary values. If you need to hold onto them, you need to copy
them into a more permanent area of memory.

You can create your own scratchpad buffer, but for small and very localized
usage, there's a global scratchpad called `syspad`[^7] which is more convenient
usage, there's a global scratchpad called `syspad`[^6] which is more convenient
to use. In our case, it will do fine.

All we need to do is to change the final `buf` reference in `nextt` to `buf


@@ 308,8 313,7 @@ more than one, just not through the `file` global structbind.
`-1 SPC <=` is false.
[^5]: I know, some symbols in C contain more than one character, but not in
foo.c. Because our tokenizer is minimal, we can afford to stay simple.
[^6]: it only works for a single character.
[^7]: a structbind to the `Scratchpad` struct.
[^6]: a structbind to the `Scratchpad` struct.

[prev]: 09-dusktillc.html
[upnext]: 11-eye.html

M 01-duskcc/10-beast/tok.fs => 01-duskcc/10-beast/tok.fs +3 -2
@@ 1,6 1,6 @@
: isWS? ( c -- f ) SPC <= ;
create symbols ," +(){};,"
: isSym? ( c -- f ) symbols 7 [c]? 0>= ;
: isSym? ( c -- f ) symbols 7 cidx dup if nip then ;
: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;

$40 const MAXTOKSZ \ maximum size for tokens


@@ 12,6 12,7 @@ create buf MAXTOKSZ 1+ allot
  dup isSym? if swap c! 1 else begin ( a c )
      dup boundary? not while ( a c )
      swap c!+ file :getc repeat ( a c )
    file :putback ( a ) buf - 1- then ( length )
    drop file pos 1- file :seek ( a )
    buf - 1- then ( length )
  dup buf c! ( length ) if buf syspad :s, else 0 then ;
: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;