From f79d48bf0a9cc659e55580bca9d108bc5bd31e34 Mon Sep 17 00:00:00 2001
From: Virgil Dupras
Date: Thu, 4 May 2023 21:15:04 -0400
Subject: [PATCH] Feeding the beast
---
01-duskcc/09-dusktillc.md | 3 +
01-duskcc/10-beast.md | 322 ++++++++++++++++++++++++++++++++++++
01-duskcc/10-beast/foo.c | 3 +
01-duskcc/10-beast/myasm.fs | 12 ++
01-duskcc/10-beast/tok.fs | 17 ++
01-duskcc/main.css | 8 +
Makefile | 3 +-
index.html | 17 +-
main.css | 8 +
9 files changed, 390 insertions(+), 3 deletions(-)
create mode 100644 01-duskcc/10-beast.md
create mode 100644 01-duskcc/10-beast/foo.c
create mode 100644 01-duskcc/10-beast/myasm.fs
create mode 100644 01-duskcc/10-beast/tok.fs
diff --git a/01-duskcc/09-dusktillc.md b/01-duskcc/09-dusktillc.md
index 1240431..9613287 100644
--- a/01-duskcc/09-dusktillc.md
+++ b/01-duskcc/09-dusktillc.md
@@ -303,6 +303,8 @@ encoding][x86enc].
In the next article, we’ll set aside our shiny new assembler for a while as we
tackle the first part of a C compiler by building a tokenizer.
+*[Next: Feeding the beast][nextup]*
+
[^1]: PC build takes a little while. This is because I insist on using Dusk’s
tools to build the destination FAT12 filesystem rather than POSIX ones, and
those tools have to run through Dusk’s POSIX VM which is pretty slow. Every
@@ -327,6 +329,7 @@ the operands is always “direct”. `reg` is the one.
[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/09-dusktillc.tar.gz
[prev]: 08-immediate.html
+[nextup]: 10-beast.html
[dusk]: http://duskos.org/
[iter]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/iter.txt
[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt
diff --git a/01-duskcc/10-beast.md b/01-duskcc/10-beast.md
new file mode 100644
index 0000000..2196705
--- /dev/null
+++ b/01-duskcc/10-beast.md
@@ -0,0 +1,322 @@
+# [Tumble Forth](/): Feeding the beast
+
+[Now that we have our assembler][prev], we can begin tackling C directly. You'll
+remember that this is the code we want to compile:
+
+ int foo(int a, int b) {
+ return a + b;
+ }
+
+The first step in the compilation process is to separate this stream into a
+series of tokens, which will then be fed to the parser.
+
+Tokenization in C is slightly more complex than tokenization in Forth:
+whitespaces **and** symbols are token boundaries. For example, `"return a+b;"`
+yields tokens `"return"`, `"a"`, `"+"`, `"b"` and `";"`.
+
+You'll see in this article that the actual logic for doing so it almost trivial,
+hardly enough to be worth a whole article. However, to build this API, we'll
+need to leverage mechanisms in Dusk that are not trivial and this article will
+walk you through them.
+
+The logic we build in this article is CPU independent. Unlike the previous
+article, it's not required to run Dusk with `make pcrun`, `make run` will also
+work. It's a bit more convenient to use under POSIX because it's faster to build
+and you can paste text in the terminal. It will pick up files in the `fs`
+subfolder in the same way as `make pcrun`.
+
+## Strings
+
+Our tokenizer will yield tokens as [strings][usage]. Strings are a series of
+contiguous bytes in memory preceded by a "length" field, which is one byte in
+size. When we refer to a string, we refer to the address of the "length" field.
+
+The `S"` word allows you to create a string literal. It's limited by Forth
+tokenization rules, which means that it must always be followed by a whitespace,
+which will not be included in the string. For example, `S" hello"` yields an
+address in memory that will look like this:
+
+ A+0 A+1 A+2 A+3 A+4 A+5
+ +-----------------------------+
+ | 05 | 68 | 65 | 6C | 6C | 6F |
+ +-----------------------------+
+
+Dusk doesn't bother with implementing `strlen`, as it's exactly the same as
+`c@`. It's also frequent to use strings with `c@+ ( addr -- addr+1 c )` to
+process strings. Example:
+
+ : foo ( str -- ) c@+ for ( a ) c@+ emit next ( a ) drop ;
+ S" hello" foo \ prints "hello", exactly like the "stype" word
+
+## Stack comments
+
+The snippet above also includes a frequent pattern in Dusk code. First, words
+are documented with a signature comment. This one indicates that it expects a
+string to live on PS top and that the word will consume it.
+
+We also see "stack status" comments, to help following along the code. For
+example, after `c@+` is called, PS becomes `( a length )`, with `a` being the
+address of the first character of the string. Because the `for` consumes
+`length`, the loop body ends up with a stack status of `( a )`.
+
+To avoid making words too heavy, we don't bother commenting the stack status at
+each step. However, Dusk has the habit of documenting stack status at particular
+points:
+
+1. Right after the beginning of the loop
+2. Sometimes at the end of a loop. Especially in cases where there is more than
+ one exit point for the loop, for example in `begin..while..repeat`, in which
+ exit points can have different stack status!
+3. At tricky points in big complicated words (of which there should be as few as
+ possible).
+
+## Crossing the streams
+
+Our tokenizer needs to feed itself from some kind of stream. You *could* feed
+yourself directly from the main stream using the `in< ( -- c )` word. Here's an
+example usage:
+
+ : foo in< emit ;
+ foo A \ the A character is not interpreted by Forth, but by "foo"
+
+This could work, but you'll spend too much time copy/pasting C code around. A
+more convenient way would be to use Dusk's [File API][file], which itself rests
+upon the [IO API][io].
+
+To try it, write your C file in Dusk's `fs/` directory under `foo.c` and then do
+this:
+
+ f" foo.c"
+ console :self file :spit \ Spits the contents of the file to the console
+
+`console` is a global [structbind][struct][^1] to the [Pipe][io] struct and
+`file` is a global structbind to the [File][file] struct.
+
+`console` is not of a particular interest to us at the moment, so we'll set it
+aside, but `file` is. `file` represents Dusk's "current work file". There's only
+one at once and we can make it open a particular path with the `f"` word, as
+shown above. Whenever we open a new file, the old one is closed[^2].
+
+`file` obeys Dusk's I/O semantics which allows to seamlessly carry streams of
+bytes left and right. Any method from the `IO` struct can be called on `file`.
+Our tokenizer will be consuming the stream character by character, so the I/O
+method we'll want to use is `:getc ( hdl -- c )`. `hdl`, a reference to the I/O
+struct, is provided automatically by structbinds, so for us, the effective word
+signature is `( -- c )`.
+
+The `:getc` word yields the character at current stream position and then
+advance this position by 1. If the end of the stream has been reached, -1 is
+returned. Therefore, we can reimplement (badly), the `:spit` method above thus:
+
+ : myspit ( -- )
+ begin ( ) file :getc dup -1 <> while ( c ) emit repeat ( c ) drop ;
+ f" foo.c"
+ myspit
+
+You'll notice that if you call `myspit` twice, the second time doesn't print
+anything. That's because the file's position is at the end of it. To rewind it,
+you can do `0 file :seek`.
+
+## Tokenizer API
+
+What we'll want to build is a word that, when called, yields either the next
+token, as a string, from `file`. If we've reached the end of the file, we yield
+zero. Let's call this word `nextt ( -- str-or-0 )` (for "next token").
+
+Let's start roughly and yield our stream line by line, leveraging
+`IO :readline`:
+
+ : nextt ( -- str-or-0 ) file :readline ;
+ f" foo.c"
+ nextt stype \ prints "int foo(int a, int b) {"
+ nextt stype \ prints " return a + b;"
+ nextt stype \ prints "}"
+ nextt . \ prints 0
+
+Now, all we have to do is to refine the process until our goal is reached.
+
+## Splitting whitespace
+
+`:readline` is a nice little trick to start out, but we'll need to do like we
+did in our baby Forth and accumulate characters into a buffer. We *could*
+imagine some kind of algorithm that yields substrings from the line yielded by
+`:readline`, but we'd need to hold onto that line in between `nextt` calls
+*and* find a way to insert token lengths in there. Much more complicated than
+simply accumulating from characters, so let's do that. First, what's a
+whitespace? Let's start a new `tok.fs` unit with this utility:
+
+ : isWS? ( c -- f ) SPC <= ;
+
+`SPC` is a system constant for $20. In signatures, `f` means "flag" (0 or 1).
+
+Then, all we need is a buffer and a loop:
+
+ $40 const MAXTOKSZ \ maximum size for tokens
+ create buf MAXTOKSZ 1+ allot
+
+ : tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
+ : nextt ( -- str-or-0 )
+ buf 1+ tonws begin ( a c )
+ dup isWS? not over -1 <> and while ( a c ) \ while not WS or EOF
+ swap c!+ file :getc repeat ( a c )
+ drop ( a ) buf - 1- ( length )
+ dup buf c! ( length ) if buf else 0 then ;
+ : tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;
+
+You now can print all tokens from `foo.fs` with `f<< tok.fs f" foo.c"
+tokstype`.
+
+As with our baby Forth word accumulator, we have two loops in this code.
+
+The first one, `tonws`[^3] simply consumes the file until a non-whitespace is
+encountered. Because this loop requires us to read a non-whitespace character,
+we need the word to yield it so it can be used later.
+
+This loop is not a `begin..until`, as covered in Starting Forth, but rather a
+`begin..while..repeat` one, which isn't covered (but is frequent among Forth
+implementations). It works in a similar manner, but allows the loop condition to
+be in the middle of the body, often simplifying it. For example, the same word
+with `begin..until` would look like this:
+
+ : tonws ( -- c ) begin file :getc dup isWS? if drop 0 else 1 then until ;
+
+The second loop, within `nextt` itself, is the main accumulating loop. We see
+that the loop begins with `buf+1` as a starting address, the `+1` begin for
+leaving space for the string count byte, which we'll set later.
+
+The `while` condition is more complex than in `tonws` because we explicitly
+have to check for EOF[^4]. To be able to `"and"` two conditions together, we
+have to juggle with `PS` a little bit.
+
+Then comes the accumulation part, which is straighforward when we use `c!+ ( c
+a -- a+1 )`.
+
+When the loop exits, `c` will be either a whitespace or -1, for which we have
+no use and drop. At that point, `a` points to the empty space following our
+last character, allowing us to easily compute the length of our accumulated
+string, which is the last thing we need to do.
+
+Finally, we need to check if we've accumulated something at all. If our result
+length is zero, this means that we've read zero or more whitespaces followed
+directly by EOF, so we have nothing to return. If it's not, we write the length
+to the first byte of `buf` and return it as the result.
+
+Our last word, `tokstype` repeatedly calls `nextt` and print the result until
+exhausted, so that we can test our tokenizer.
+
+## Tokenizing symbols
+
+To have a proper tokenizer, we need to not only split by whitespace, but also
+by symbols, which unlike whitespaces are also tokens. This presents new
+challenges to us:
+
+1. Stop accumulating a token when we encounter a symbol character[^5].
+2. Don't drop that character like we do with the whitespace.
+3. When the first character we meet is a symbol, stop accumulating and return
+it as a token.
+
+For the first challenge, let me show you the word `[c]? ( c a u -- idx )`:
+
+ create symbols ," +(){};,"
+ : isSym? ( c -- f ) symbols 7 [c]? 0>= ;
+
+The word `[c]?` finds the index of character `c` in memory range starting at
+address `a` with a length `u`. We've created such a memory range in `symbols`
+above, which allows us to easily determine whether a given character is in that
+range.
+
+For the second challenge, we could save the last character into a variable
+instead of dropping it and, on the following `nextt` call, pick it up instead
+of calling `:getc`, but the I/O subsystem *already* has such a system! The
+method is called `:putback ( c hdl -- )`[^6]. When that's called, the next read
+operation will transparently include it. It sounds like nothing, but it saves a
+lot of complicated logic in some situations.
+
+For the third challenge, we need to make a symbol check before the loop to
+handle this case and not enter the loop when we encounter it.
+
+This could give use a `nextt` that looks like this:
+
+ : boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
+ : nextt ( -- str-or-0 )
+ buf 1+ tonws ( a c )
+ dup isSym? if swap c! 1 else begin ( a c )
+ dup boundary? not while ( a c )
+ swap c!+ file :getc repeat ( a c )
+ file :putback ( a ) buf - 1- then ( length )
+ dup buf c! ( length ) if buf else 0 then ;
+
+With this new code, `tokstype` will spit the right tokens.
+
+## Scratchpads
+
+Does this mean we're finished? Not yet because our tokenizer has one big
+problem:
+
+ f<< tok.fs
+ f" foo.c"
+ nextt nextt stype stype
+
+*What?* twice the second token? Yup. That's because we use a static buffer.
+Calling `nextt` always overwrites the previous result.
+
+The parser will often have to keep a handful of tokens in memory at once and
+juggle with them a little bit, so this won't do.
+
+To solve this problem, we're going to use one of Dusk's [dynamic memory
+allocator][alloc], the [scratchpad][scratch]. A scratchpad is a rolling buffer.
+We allocate stuff to it and when it reaches the end of its buffer, it goes back
+to the beginning. It has the advantage of being simple to use because we never
+have to free what we allocate. The drawback is that these memory areas can only
+be used as temporary values. If you need to hold onto them, you need to copy
+them into a more permanent area of memory.
+
+You can create your own scratchpad buffer, but for small and very localized
+usage, there's a global scratchpad called `syspad`[^7] which is more convenient
+to use. In our case, it will do fine.
+
+All we need to do is to change the final `buf` reference in `nextt` to `buf
+syspad :s,`, that is, a method that allocates enough space for the supplied
+string, copies it into the scratchpad, and return the address of the newly
+allocated area.
+
+That's it! we have a functional tokenizer that will be good enough for `foo.c`!
+
+## Exercises
+
+1. Did you notice that we don't check for `buf` overflow? How would you go about
+implementing such a check?
+2. We're lucky because none of the symbols we use in `foo.c` are more than a
+single character. But if we had `>>=` in there, the tokenization logic would
+change significantly. Wanna try?
+3. Try handling comments.
+
+## Up next
+
+In the next and last chapter of this story arc, we build a parser and, with the
+help of the assembler and tokenizer, will be able to compile the `foo` function
+to an executable i386 word!
+
+[^1]: I won't be explaining the struct system in details in this article. You're
+welcome to read the docs about it, of course, but it's not critical. Suffice it
+to say that when we call a structbind word, we "activate" its struct namespace,
+so the next word is a kind of "method" call.
+[^2]: this doesn't mean that Dusk can only open one file at once. It can open
+more than one, just not through the `file` global structbind.
+[^3]: To non-whitespace
+[^4]: which is *not* a whitespace because all numbers in Dusk are unsigned. So
+`-1 SPC <=` is false.
+[^5]: I know, some symbols in C contain more than one character, but not in
+foo.c. Because our tokenizer is minimal, we can afford to stay simple.
+[^6]: it only works for a single character.
+[^7]: a structbind to the `Scratchpad` struct.
+
+[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/10-beast.tar.gz
+[prev]: 09-dusktillc.html
+[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt
+[struct]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/struct.txt
+[io]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/io.txt
+[file]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/file.txt
+[alloc]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/alloc.txt
+[scratch]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/scratch.txt
+
diff --git a/01-duskcc/10-beast/foo.c b/01-duskcc/10-beast/foo.c
new file mode 100644
index 0000000..295c1f6
--- /dev/null
+++ b/01-duskcc/10-beast/foo.c
@@ -0,0 +1,3 @@
+int foo(int a, int b) {
+ return a + b;
+}
diff --git a/01-duskcc/10-beast/myasm.fs b/01-duskcc/10-beast/myasm.fs
new file mode 100644
index 0000000..1094528
--- /dev/null
+++ b/01-duskcc/10-beast/myasm.fs
@@ -0,0 +1,12 @@
+0 const eax
+1 const ecx
+2 const edx
+3 const ebx
+4 const esp
+5 const ebp
+6 const esi
+7 const edi
+
+: ret, $c3 c, ;
+: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ;
+: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ;
diff --git a/01-duskcc/10-beast/tok.fs b/01-duskcc/10-beast/tok.fs
new file mode 100644
index 0000000..6da7a77
--- /dev/null
+++ b/01-duskcc/10-beast/tok.fs
@@ -0,0 +1,17 @@
+: isWS? ( c -- f ) SPC <= ;
+create symbols ," +(){};,"
+: isSym? ( c -- f ) symbols 7 [c]? 0>= ;
+: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
+
+$40 const MAXTOKSZ \ maximum size for tokens
+create buf MAXTOKSZ 1+ allot
+
+: tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
+: nextt ( -- str-or-0 )
+ buf 1+ tonws ( a c )
+ dup isSym? if swap c! 1 else begin ( a c )
+ dup boundary? not while ( a c )
+ swap c!+ file :getc repeat ( a c )
+ file :putback ( a ) buf - 1- then ( length )
+ dup buf c! ( length ) if buf syspad :s, else 0 then ;
+: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;
diff --git a/01-duskcc/main.css b/01-duskcc/main.css
index 3edcec4..0a302f6 100644
--- a/01-duskcc/main.css
+++ b/01-duskcc/main.css
@@ -18,3 +18,11 @@ table td, table th {
blockquote {
font-style: italic;
}
+
+code {
+ background-color: #eeeeee
+}
+
+pre code {
+ display: block;
+}
diff --git a/Makefile b/Makefile
index ca553ef..2ad76be 100644
--- a/Makefile
+++ b/Makefile
@@ -8,7 +8,8 @@ ARTICLES_WITH_TGZ = \
01-duskcc/06-taletwostacks \
01-duskcc/07-babywalk \
01-duskcc/08-immediate \
- 01-duskcc/09-dusktillc
+ 01-duskcc/09-dusktillc \
+ 01-duskcc/10-beast
ARTICLES = \
$(ARTICLES_WITH_TGZ)
diff --git a/index.html b/index.html
index a4cde16..7109767 100644
--- a/index.html
+++ b/index.html
@@ -54,7 +54,19 @@ before you receive the bundle.
Buckle up, Dorothy (in progress)
-My “pilot” story arc is on the subject of Dusk OS’ C compiler.
+In my “pilot” story arc, we peek in
+disgust in the abyss of modern software complexity and escape this dystopia by
+tumbling down the rabbit hole of low level development.
+
+
+Starting from bare metal on the PC platform, we build a Forth from scratch, then
+switch to Dusk OS and then build a partial C
+compiler (just enough to compile our example code), again from scratch.
+
+
+The "teaser" part of this story arc is a rather large part of the whole, but
+that's because it's the "build a Forth" part and I believe that everyone should
+do that, so I want to keep it openly accessible.
Table of Contents
@@ -68,7 +80,8 @@ My “pilot” story arc is on the subj
Baby's first steps
The Unbearable Immediateness of Compiling
From Dusk Till C
- ... to write ...
+ Feeding the beast
+ In the eye of the compiler (not written yet)