~vdupras/tumbleforth

f79d48bf0a9cc659e55580bca9d108bc5bd31e34 — Virgil Dupras 1 year, 4 months ago 0559f09
Feeding the beast
M 01-duskcc/09-dusktillc.md => 01-duskcc/09-dusktillc.md +3 -0
@@ 303,6 303,8 @@ encoding][x86enc].
In the next article, we’ll set aside our shiny new assembler for a while as we
tackle the first part of a C compiler by building a tokenizer.

*[Next: Feeding the beast][nextup]*

[^1]: PC build takes a little while. This is because I insist on using Dusk’s
tools to build the destination FAT12 filesystem rather than POSIX ones, and
those tools have to run through Dusk’s POSIX VM which is pretty slow. Every


@@ 327,6 329,7 @@ the operands is always “direct”. `reg` is the one.

[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/09-dusktillc.tar.gz
[prev]: 08-immediate.html
[nextup]: 10-beast.html
[dusk]: http://duskos.org/
[iter]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/iter.txt
[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt

A 01-duskcc/10-beast.md => 01-duskcc/10-beast.md +322 -0
@@ 0,0 1,322 @@
# [Tumble Forth](/): Feeding the beast

[Now that we have our assembler][prev], we can begin tackling C directly. You'll
remember that this is the code we want to compile:

    int foo(int a, int b) {
        return a + b;
    }

The first step in the compilation process is to separate this stream into a
series of tokens, which will then be fed to the parser.

Tokenization in C is slightly more complex than tokenization in Forth:
whitespaces **and** symbols are token boundaries. For example, `"return a+b;"`
yields tokens `"return"`, `"a"`, `"+"`, `"b"` and `";"`.

You'll see in this article that the actual logic for doing so it almost trivial,
hardly enough to be worth a whole article. However, to build this API, we'll
need to leverage mechanisms in Dusk that are not trivial and this article will
walk you through them.

The logic we build in this article is CPU independent. Unlike the previous
article, it's not required to run Dusk with `make pcrun`, `make run` will also
work. It's a bit more convenient to use under POSIX because it's faster to build
and you can paste text in the terminal. It will pick up files in the `fs`
subfolder in the same way as `make pcrun`.

## Strings

Our tokenizer will yield tokens as [strings][usage]. Strings are a series of
contiguous bytes in memory preceded by a "length" field, which is one byte in
size. When we refer to a string, we refer to the address of the "length" field.

The `S"` word allows you to create a string literal. It's limited by Forth
tokenization rules, which means that it must always be followed by a whitespace,
which will not be included in the string. For example, `S" hello"` yields an
address in memory that will look like this:

     A+0  A+1  A+2  A+3  A+4  A+5
    +-----------------------------+
    | 05 | 68 | 65 | 6C | 6C | 6F |
    +-----------------------------+

Dusk doesn't bother with implementing `strlen`, as it's exactly the same as
`c@`. It's also frequent to use strings with `c@+ ( addr -- addr+1 c )` to
process strings. Example:

    : foo ( str -- ) c@+ for ( a ) c@+ emit next ( a ) drop ;
    S" hello" foo \ prints "hello", exactly like the "stype" word

## Stack comments

The snippet above also includes a frequent pattern in Dusk code. First, words
are documented with a signature comment. This one indicates that it expects a
string to live on PS top and that the word will consume it.

We also see "stack status" comments, to help following along the code. For
example, after `c@+` is called, PS becomes `( a length )`, with `a` being the
address of the first character of the string. Because the `for` consumes
`length`, the loop body ends up with a stack status of `( a )`.

To avoid making words too heavy, we don't bother commenting the stack status at
each step. However, Dusk has the habit of documenting stack status at particular
points:

1. Right after the beginning of the loop
2. Sometimes at the end of a loop. Especially in cases where there is more than
   one exit point for the loop, for example in `begin..while..repeat`, in which
   exit points can have different stack status!
3. At tricky points in big complicated words (of which there should be as few as
   possible).

## Crossing the streams

Our tokenizer needs to feed itself from some kind of stream. You *could* feed
yourself directly from the main stream using the `in< ( -- c )` word. Here's an
example usage:

    : foo in< emit ;
    foo A \ the A character is not interpreted by Forth, but by "foo"

This could work, but you'll spend too much time copy/pasting C code around. A
more convenient way would be to use Dusk's [File API][file], which itself rests
upon the [IO API][io].

To try it, write your C file in Dusk's `fs/` directory under `foo.c` and then do
this:

    f" foo.c"
    console :self file :spit \ Spits the contents of the file to the console

`console` is a global [structbind][struct][^1] to the [Pipe][io] struct and
`file` is a global structbind to the [File][file] struct.

`console` is not of a particular interest to us at the moment, so we'll set it
aside, but `file` is. `file` represents Dusk's "current work file". There's only
one at once and we can make it open a particular path with the `f"` word, as
shown above. Whenever we open a new file, the old one is closed[^2].

`file` obeys Dusk's I/O semantics which allows to seamlessly carry streams of
bytes left and right. Any method from the `IO` struct can be called on `file`.
Our tokenizer will be consuming the stream character by character, so the I/O
method we'll want to use is `:getc ( hdl -- c )`. `hdl`, a reference to the I/O
struct, is provided automatically by structbinds, so for us, the effective word
signature is `( -- c )`.

The `:getc` word yields the character at current stream position and then
advance this position by 1. If the end of the stream has been reached, -1 is
returned. Therefore, we can reimplement (badly), the `:spit` method above thus:

    : myspit ( -- )
      begin ( ) file :getc dup -1 <> while ( c ) emit repeat ( c ) drop ;
    f" foo.c"
    myspit

You'll notice that if you call `myspit` twice, the second time doesn't print
anything. That's because the file's position is at the end of it. To rewind it,
you can do `0 file :seek`.

## Tokenizer API

What we'll want to build is a word that, when called, yields either the next
token, as a string, from `file`. If we've reached the end of the file, we yield
zero. Let's call this word `nextt ( -- str-or-0 )` (for "next token").

Let's start roughly and yield our stream line by line, leveraging
`IO :readline`:

    : nextt ( -- str-or-0 ) file :readline ;
    f" foo.c"
    nextt stype \ prints "int foo(int a, int b) {"
    nextt stype \ prints "    return a + b;"
    nextt stype \ prints "}"
    nextt .     \ prints 0

Now, all we have to do is to refine the process until our goal is reached.

## Splitting whitespace

`:readline` is a nice little trick to start out, but we'll need to do like we
did in our baby Forth and accumulate characters into a buffer. We *could*
imagine some kind of algorithm that yields substrings from the line yielded by
`:readline`, but we'd need to hold onto that line in between `nextt` calls
*and* find a way to insert token lengths in there. Much more complicated than
simply accumulating from characters, so let's do that. First, what's a
whitespace? Let's start a new `tok.fs` unit with this utility:

    : isWS? ( c -- f ) SPC <= ;

`SPC` is a system constant for $20. In signatures, `f` means "flag" (0 or 1).

Then, all we need is a buffer and a loop:

    $40 const MAXTOKSZ \ maximum size for tokens
    create buf MAXTOKSZ 1+ allot

    : tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
    : nextt ( -- str-or-0 )
      buf 1+ tonws begin ( a c )
        dup isWS? not over -1 <> and while ( a c ) \ while not WS or EOF
        swap c!+ file :getc repeat ( a c )
      drop ( a ) buf - 1- ( length )
      dup buf c! ( length ) if buf else 0 then ;
    : tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;

You now can print all tokens from `foo.fs` with `f<< tok.fs f" foo.c"
tokstype`.

As with our baby Forth word accumulator, we have two loops in this code.

The first one, `tonws`[^3] simply consumes the file until a non-whitespace is
encountered. Because this loop requires us to read a non-whitespace character,
we need the word to yield it so it can be used later.

This loop is not a `begin..until`, as covered in Starting Forth, but rather a
`begin..while..repeat` one, which isn't covered (but is frequent among Forth
implementations). It works in a similar manner, but allows the loop condition to
be in the middle of the body, often simplifying it. For example, the same word
with `begin..until` would look like this:

    : tonws ( -- c ) begin file :getc dup isWS? if drop 0 else 1 then until ;

The second loop, within `nextt` itself, is the main accumulating loop. We see
that the loop begins with `buf+1` as a starting address, the `+1` begin for
leaving space for the string count byte, which we'll set later.

The `while` condition is more complex than in `tonws` because we explicitly
have to check for EOF[^4]. To be able to `"and"` two conditions together, we
have to juggle with `PS` a little bit.

Then comes the accumulation part, which is straighforward when we use `c!+ ( c
a -- a+1 )`.

When the loop exits, `c` will be either a whitespace or -1, for which we have
no use and drop. At that point, `a` points to the empty space following our
last character, allowing us to easily compute the length of our accumulated
string, which is the last thing we need to do.

Finally, we need to check if we've accumulated something at all. If our result
length is zero, this means that we've read zero or more whitespaces followed
directly by EOF, so we have nothing to return. If it's not, we write the length
to the first byte of `buf` and return it as the result.

Our last word, `tokstype` repeatedly calls `nextt` and print the result until
exhausted, so that we can test our tokenizer.

## Tokenizing symbols

To have a proper tokenizer, we need to not only split by whitespace, but also
by symbols, which unlike whitespaces are also tokens. This presents new
challenges to us:

1. Stop accumulating a token when we encounter a symbol character[^5].
2. Don't drop that character like we do with the whitespace.
3. When the first character we meet is a symbol, stop accumulating and return
it as a token.

For the first challenge, let me show you the word `[c]? ( c a u -- idx )`:

    create symbols ," +(){};,"
    : isSym? ( c -- f ) symbols 7 [c]? 0>= ;

The word `[c]?` finds the index of character `c` in memory range starting at
address `a` with a length `u`. We've created such a memory range in `symbols`
above, which allows us to easily determine whether a given character is in that
range. 

For the second challenge, we could save the last character into a variable
instead of dropping it and, on the following `nextt` call, pick it up instead
of calling `:getc`, but the I/O subsystem *already* has such a system! The
method is called `:putback ( c hdl -- )`[^6]. When that's called, the next read
operation will transparently include it. It sounds like nothing, but it saves a
lot of complicated logic in some situations.

For the third challenge, we need to make a symbol check before the loop to
handle this case and not enter the loop when we encounter it.

This could give use a `nextt` that looks like this:

    : boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
    : nextt ( -- str-or-0 )
      buf 1+ tonws ( a c )
      dup isSym? if swap c! 1 else begin ( a c )
          dup boundary? not while ( a c )
          swap c!+ file :getc repeat ( a c )
        file :putback ( a ) buf - 1- then ( length )
      dup buf c! ( length ) if buf else 0 then ;

With this new code, `tokstype` will spit the right tokens.

## Scratchpads

Does this mean we're finished? Not yet because our tokenizer has one big
problem:

    f<< tok.fs
    f" foo.c"
    nextt nextt stype stype

*What?* twice the second token? Yup. That's because we use a static buffer.
Calling `nextt` always overwrites the previous result.

The parser will often have to keep a handful of tokens in memory at once and
juggle with them a little bit, so this won't do.

To solve this problem, we're going to use one of Dusk's [dynamic memory
allocator][alloc], the [scratchpad][scratch]. A scratchpad is a rolling buffer.
We allocate stuff to it and when it reaches the end of its buffer, it goes back
to the beginning. It has the advantage of being simple to use because we never
have to free what we allocate. The drawback is that these memory areas can only
be used as temporary values. If you need to hold onto them, you need to copy
them into a more permanent area of memory.

You can create your own scratchpad buffer, but for small and very localized
usage, there's a global scratchpad called `syspad`[^7] which is more convenient
to use. In our case, it will do fine.

All we need to do is to change the final `buf` reference in `nextt` to `buf
syspad :s,`, that is, a method that allocates enough space for the supplied
string, copies it into the scratchpad, and return the address of the newly
allocated area.

That's it! we have a functional tokenizer that will be good enough for `foo.c`!

## Exercises

1. Did you notice that we don't check for `buf` overflow? How would you go about
implementing such a check?
2. We're lucky because none of the symbols we use in `foo.c` are more than a
single character. But if we had `>>=` in there, the tokenization logic would
change significantly. Wanna try?
3. Try handling comments.

## Up next

In the next and last chapter of this story arc, we build a parser and, with the
help of the assembler and tokenizer, will be able to compile the `foo` function
to an executable i386 word!

[^1]: I won't be explaining the struct system in details in this article. You're
welcome to read the docs about it, of course, but it's not critical. Suffice it
to say that when we call a structbind word, we "activate" its struct namespace,
so the next word is a kind of "method" call.
[^2]: this doesn't mean that Dusk can only open one file at once. It can open
more than one, just not through the `file` global structbind.
[^3]: To non-whitespace
[^4]: which is *not* a whitespace because all numbers in Dusk are unsigned. So
`-1 SPC <=` is false.
[^5]: I know, some symbols in C contain more than one character, but not in
foo.c. Because our tokenizer is minimal, we can afford to stay simple.
[^6]: it only works for a single character.
[^7]: a structbind to the `Scratchpad` struct.

[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/10-beast.tar.gz
[prev]: 09-dusktillc.html
[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt
[struct]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/struct.txt
[io]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/io.txt
[file]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/file.txt
[alloc]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/alloc.txt
[scratch]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/scratch.txt


A 01-duskcc/10-beast/foo.c => 01-duskcc/10-beast/foo.c +3 -0
@@ 0,0 1,3 @@
int foo(int a, int b) {
    return a + b;
}

A 01-duskcc/10-beast/myasm.fs => 01-duskcc/10-beast/myasm.fs +12 -0
@@ 0,0 1,12 @@
0 const eax
1 const ecx
2 const edx
3 const ebx
4 const esp
5 const ebp
6 const esi
7 const edi

: ret, $c3 c, ;
: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ;
: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ;

A 01-duskcc/10-beast/tok.fs => 01-duskcc/10-beast/tok.fs +17 -0
@@ 0,0 1,17 @@
: isWS? ( c -- f ) SPC <= ;
create symbols ," +(){};,"
: isSym? ( c -- f ) symbols 7 [c]? 0>= ;
: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;

$40 const MAXTOKSZ \ maximum size for tokens
create buf MAXTOKSZ 1+ allot

: tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
: nextt ( -- str-or-0 )
  buf 1+ tonws ( a c )
  dup isSym? if swap c! 1 else begin ( a c )
      dup boundary? not while ( a c )
      swap c!+ file :getc repeat ( a c )
    file :putback ( a ) buf - 1- then ( length )
  dup buf c! ( length ) if buf syspad :s, else 0 then ;
: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;

M 01-duskcc/main.css => 01-duskcc/main.css +8 -0
@@ 18,3 18,11 @@ table td, table th {
blockquote {
  font-style: italic;
}

code {
  background-color: #eeeeee
}

pre code {
  display: block;
}

M Makefile => Makefile +2 -1
@@ 8,7 8,8 @@ ARTICLES_WITH_TGZ = \
	01-duskcc/06-taletwostacks \
	01-duskcc/07-babywalk \
	01-duskcc/08-immediate \
	01-duskcc/09-dusktillc
	01-duskcc/09-dusktillc \
	01-duskcc/10-beast
ARTICLES = \
	$(ARTICLES_WITH_TGZ)


M index.html => index.html +15 -2
@@ 54,7 54,19 @@ before you receive the bundle.</p>
<h3>Buckle up, Dorothy (in progress)</h3>

<p>
My <a href="01-duskcc/01-buckleup.html">“pilot” story arc</a> is on the subject of Dusk OS’ C compiler.
In my <a href="01-duskcc/01-buckleup.html">“pilot” story arc</a>, we peek in
disgust in the abyss of modern software complexity and escape this dystopia by
tumbling down the rabbit hole of low level development.
</p>
<p>
Starting from bare metal on the PC platform, we build a Forth from scratch, then
switch to <a href="http://duskos.org">Dusk OS</a> and then build a partial C
compiler (just enough to compile our example code), again from scratch.
</p>
<p>
The "teaser" part of this story arc is a rather large part of the whole, but
that's because it's the "build a Forth" part and I believe that everyone should
do that, so I want to keep it openly accessible.
</p>

<p>Table of Contents</p>


@@ 68,7 80,8 @@ My <a href="01-duskcc/01-buckleup.html">“pilot” story arc</a> is on the subj
    <li><a href="01-duskcc/07-babywalk.html">Baby's first steps</a></li>
    <li><a href="01-duskcc/08-immediate.html">The Unbearable Immediateness of Compiling</a></li>
    <li>From Dusk Till C</li>
    <li>... to write ...</li>
    <li>Feeding the beast</li>
    <li>In the eye of the compiler (not written yet)</li>
</ol>

<script async

M main.css => main.css +8 -0
@@ 18,3 18,11 @@ table td, table th {
blockquote {
  font-style: italic;
}

code {
  background-color: #eeeeee
}

pre code {
  display: block;
}