From f79d48bf0a9cc659e55580bca9d108bc5bd31e34 Mon Sep 17 00:00:00 2001 From: Virgil Dupras Date: Thu, 4 May 2023 21:15:04 -0400 Subject: [PATCH] Feeding the beast --- 01-duskcc/09-dusktillc.md | 3 + 01-duskcc/10-beast.md | 322 ++++++++++++++++++++++++++++++++++++ 01-duskcc/10-beast/foo.c | 3 + 01-duskcc/10-beast/myasm.fs | 12 ++ 01-duskcc/10-beast/tok.fs | 17 ++ 01-duskcc/main.css | 8 + Makefile | 3 +- index.html | 17 +- main.css | 8 + 9 files changed, 390 insertions(+), 3 deletions(-) create mode 100644 01-duskcc/10-beast.md create mode 100644 01-duskcc/10-beast/foo.c create mode 100644 01-duskcc/10-beast/myasm.fs create mode 100644 01-duskcc/10-beast/tok.fs diff --git a/01-duskcc/09-dusktillc.md b/01-duskcc/09-dusktillc.md index 1240431..9613287 100644 --- a/01-duskcc/09-dusktillc.md +++ b/01-duskcc/09-dusktillc.md @@ -303,6 +303,8 @@ encoding][x86enc]. In the next article, we’ll set aside our shiny new assembler for a while as we tackle the first part of a C compiler by building a tokenizer. +*[Next: Feeding the beast][nextup]* + [^1]: PC build takes a little while. This is because I insist on using Dusk’s tools to build the destination FAT12 filesystem rather than POSIX ones, and those tools have to run through Dusk’s POSIX VM which is pretty slow. Every @@ -327,6 +329,7 @@ the operands is always “direct”. `reg` is the one. [srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/09-dusktillc.tar.gz [prev]: 08-immediate.html +[nextup]: 10-beast.html [dusk]: http://duskos.org/ [iter]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/iter.txt [usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt diff --git a/01-duskcc/10-beast.md b/01-duskcc/10-beast.md new file mode 100644 index 0000000..2196705 --- /dev/null +++ b/01-duskcc/10-beast.md @@ -0,0 +1,322 @@ +# [Tumble Forth](/): Feeding the beast + +[Now that we have our assembler][prev], we can begin tackling C directly. You'll +remember that this is the code we want to compile: + + int foo(int a, int b) { + return a + b; + } + +The first step in the compilation process is to separate this stream into a +series of tokens, which will then be fed to the parser. + +Tokenization in C is slightly more complex than tokenization in Forth: +whitespaces **and** symbols are token boundaries. For example, `"return a+b;"` +yields tokens `"return"`, `"a"`, `"+"`, `"b"` and `";"`. + +You'll see in this article that the actual logic for doing so it almost trivial, +hardly enough to be worth a whole article. However, to build this API, we'll +need to leverage mechanisms in Dusk that are not trivial and this article will +walk you through them. + +The logic we build in this article is CPU independent. Unlike the previous +article, it's not required to run Dusk with `make pcrun`, `make run` will also +work. It's a bit more convenient to use under POSIX because it's faster to build +and you can paste text in the terminal. It will pick up files in the `fs` +subfolder in the same way as `make pcrun`. + +## Strings + +Our tokenizer will yield tokens as [strings][usage]. Strings are a series of +contiguous bytes in memory preceded by a "length" field, which is one byte in +size. When we refer to a string, we refer to the address of the "length" field. + +The `S"` word allows you to create a string literal. It's limited by Forth +tokenization rules, which means that it must always be followed by a whitespace, +which will not be included in the string. For example, `S" hello"` yields an +address in memory that will look like this: + + A+0 A+1 A+2 A+3 A+4 A+5 + +-----------------------------+ + | 05 | 68 | 65 | 6C | 6C | 6F | + +-----------------------------+ + +Dusk doesn't bother with implementing `strlen`, as it's exactly the same as +`c@`. It's also frequent to use strings with `c@+ ( addr -- addr+1 c )` to +process strings. Example: + + : foo ( str -- ) c@+ for ( a ) c@+ emit next ( a ) drop ; + S" hello" foo \ prints "hello", exactly like the "stype" word + +## Stack comments + +The snippet above also includes a frequent pattern in Dusk code. First, words +are documented with a signature comment. This one indicates that it expects a +string to live on PS top and that the word will consume it. + +We also see "stack status" comments, to help following along the code. For +example, after `c@+` is called, PS becomes `( a length )`, with `a` being the +address of the first character of the string. Because the `for` consumes +`length`, the loop body ends up with a stack status of `( a )`. + +To avoid making words too heavy, we don't bother commenting the stack status at +each step. However, Dusk has the habit of documenting stack status at particular +points: + +1. Right after the beginning of the loop +2. Sometimes at the end of a loop. Especially in cases where there is more than + one exit point for the loop, for example in `begin..while..repeat`, in which + exit points can have different stack status! +3. At tricky points in big complicated words (of which there should be as few as + possible). + +## Crossing the streams + +Our tokenizer needs to feed itself from some kind of stream. You *could* feed +yourself directly from the main stream using the `in< ( -- c )` word. Here's an +example usage: + + : foo in< emit ; + foo A \ the A character is not interpreted by Forth, but by "foo" + +This could work, but you'll spend too much time copy/pasting C code around. A +more convenient way would be to use Dusk's [File API][file], which itself rests +upon the [IO API][io]. + +To try it, write your C file in Dusk's `fs/` directory under `foo.c` and then do +this: + + f" foo.c" + console :self file :spit \ Spits the contents of the file to the console + +`console` is a global [structbind][struct][^1] to the [Pipe][io] struct and +`file` is a global structbind to the [File][file] struct. + +`console` is not of a particular interest to us at the moment, so we'll set it +aside, but `file` is. `file` represents Dusk's "current work file". There's only +one at once and we can make it open a particular path with the `f"` word, as +shown above. Whenever we open a new file, the old one is closed[^2]. + +`file` obeys Dusk's I/O semantics which allows to seamlessly carry streams of +bytes left and right. Any method from the `IO` struct can be called on `file`. +Our tokenizer will be consuming the stream character by character, so the I/O +method we'll want to use is `:getc ( hdl -- c )`. `hdl`, a reference to the I/O +struct, is provided automatically by structbinds, so for us, the effective word +signature is `( -- c )`. + +The `:getc` word yields the character at current stream position and then +advance this position by 1. If the end of the stream has been reached, -1 is +returned. Therefore, we can reimplement (badly), the `:spit` method above thus: + + : myspit ( -- ) + begin ( ) file :getc dup -1 <> while ( c ) emit repeat ( c ) drop ; + f" foo.c" + myspit + +You'll notice that if you call `myspit` twice, the second time doesn't print +anything. That's because the file's position is at the end of it. To rewind it, +you can do `0 file :seek`. + +## Tokenizer API + +What we'll want to build is a word that, when called, yields either the next +token, as a string, from `file`. If we've reached the end of the file, we yield +zero. Let's call this word `nextt ( -- str-or-0 )` (for "next token"). + +Let's start roughly and yield our stream line by line, leveraging +`IO :readline`: + + : nextt ( -- str-or-0 ) file :readline ; + f" foo.c" + nextt stype \ prints "int foo(int a, int b) {" + nextt stype \ prints " return a + b;" + nextt stype \ prints "}" + nextt . \ prints 0 + +Now, all we have to do is to refine the process until our goal is reached. + +## Splitting whitespace + +`:readline` is a nice little trick to start out, but we'll need to do like we +did in our baby Forth and accumulate characters into a buffer. We *could* +imagine some kind of algorithm that yields substrings from the line yielded by +`:readline`, but we'd need to hold onto that line in between `nextt` calls +*and* find a way to insert token lengths in there. Much more complicated than +simply accumulating from characters, so let's do that. First, what's a +whitespace? Let's start a new `tok.fs` unit with this utility: + + : isWS? ( c -- f ) SPC <= ; + +`SPC` is a system constant for $20. In signatures, `f` means "flag" (0 or 1). + +Then, all we need is a buffer and a loop: + + $40 const MAXTOKSZ \ maximum size for tokens + create buf MAXTOKSZ 1+ allot + + : tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ; + : nextt ( -- str-or-0 ) + buf 1+ tonws begin ( a c ) + dup isWS? not over -1 <> and while ( a c ) \ while not WS or EOF + swap c!+ file :getc repeat ( a c ) + drop ( a ) buf - 1- ( length ) + dup buf c! ( length ) if buf else 0 then ; + : tokstype ( -- ) begin nextt ?dup while stype nl> repeat ; + +You now can print all tokens from `foo.fs` with `f<< tok.fs f" foo.c" +tokstype`. + +As with our baby Forth word accumulator, we have two loops in this code. + +The first one, `tonws`[^3] simply consumes the file until a non-whitespace is +encountered. Because this loop requires us to read a non-whitespace character, +we need the word to yield it so it can be used later. + +This loop is not a `begin..until`, as covered in Starting Forth, but rather a +`begin..while..repeat` one, which isn't covered (but is frequent among Forth +implementations). It works in a similar manner, but allows the loop condition to +be in the middle of the body, often simplifying it. For example, the same word +with `begin..until` would look like this: + + : tonws ( -- c ) begin file :getc dup isWS? if drop 0 else 1 then until ; + +The second loop, within `nextt` itself, is the main accumulating loop. We see +that the loop begins with `buf+1` as a starting address, the `+1` begin for +leaving space for the string count byte, which we'll set later. + +The `while` condition is more complex than in `tonws` because we explicitly +have to check for EOF[^4]. To be able to `"and"` two conditions together, we +have to juggle with `PS` a little bit. + +Then comes the accumulation part, which is straighforward when we use `c!+ ( c +a -- a+1 )`. + +When the loop exits, `c` will be either a whitespace or -1, for which we have +no use and drop. At that point, `a` points to the empty space following our +last character, allowing us to easily compute the length of our accumulated +string, which is the last thing we need to do. + +Finally, we need to check if we've accumulated something at all. If our result +length is zero, this means that we've read zero or more whitespaces followed +directly by EOF, so we have nothing to return. If it's not, we write the length +to the first byte of `buf` and return it as the result. + +Our last word, `tokstype` repeatedly calls `nextt` and print the result until +exhausted, so that we can test our tokenizer. + +## Tokenizing symbols + +To have a proper tokenizer, we need to not only split by whitespace, but also +by symbols, which unlike whitespaces are also tokens. This presents new +challenges to us: + +1. Stop accumulating a token when we encounter a symbol character[^5]. +2. Don't drop that character like we do with the whitespace. +3. When the first character we meet is a symbol, stop accumulating and return +it as a token. + +For the first challenge, let me show you the word `[c]? ( c a u -- idx )`: + + create symbols ," +(){};," + : isSym? ( c -- f ) symbols 7 [c]? 0>= ; + +The word `[c]?` finds the index of character `c` in memory range starting at +address `a` with a length `u`. We've created such a memory range in `symbols` +above, which allows us to easily determine whether a given character is in that +range. + +For the second challenge, we could save the last character into a variable +instead of dropping it and, on the following `nextt` call, pick it up instead +of calling `:getc`, but the I/O subsystem *already* has such a system! The +method is called `:putback ( c hdl -- )`[^6]. When that's called, the next read +operation will transparently include it. It sounds like nothing, but it saves a +lot of complicated logic in some situations. + +For the third challenge, we need to make a symbol check before the loop to +handle this case and not enter the loop when we encounter it. + +This could give use a `nextt` that looks like this: + + : boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ; + : nextt ( -- str-or-0 ) + buf 1+ tonws ( a c ) + dup isSym? if swap c! 1 else begin ( a c ) + dup boundary? not while ( a c ) + swap c!+ file :getc repeat ( a c ) + file :putback ( a ) buf - 1- then ( length ) + dup buf c! ( length ) if buf else 0 then ; + +With this new code, `tokstype` will spit the right tokens. + +## Scratchpads + +Does this mean we're finished? Not yet because our tokenizer has one big +problem: + + f<< tok.fs + f" foo.c" + nextt nextt stype stype + +*What?* twice the second token? Yup. That's because we use a static buffer. +Calling `nextt` always overwrites the previous result. + +The parser will often have to keep a handful of tokens in memory at once and +juggle with them a little bit, so this won't do. + +To solve this problem, we're going to use one of Dusk's [dynamic memory +allocator][alloc], the [scratchpad][scratch]. A scratchpad is a rolling buffer. +We allocate stuff to it and when it reaches the end of its buffer, it goes back +to the beginning. It has the advantage of being simple to use because we never +have to free what we allocate. The drawback is that these memory areas can only +be used as temporary values. If you need to hold onto them, you need to copy +them into a more permanent area of memory. + +You can create your own scratchpad buffer, but for small and very localized +usage, there's a global scratchpad called `syspad`[^7] which is more convenient +to use. In our case, it will do fine. + +All we need to do is to change the final `buf` reference in `nextt` to `buf +syspad :s,`, that is, a method that allocates enough space for the supplied +string, copies it into the scratchpad, and return the address of the newly +allocated area. + +That's it! we have a functional tokenizer that will be good enough for `foo.c`! + +## Exercises + +1. Did you notice that we don't check for `buf` overflow? How would you go about +implementing such a check? +2. We're lucky because none of the symbols we use in `foo.c` are more than a +single character. But if we had `>>=` in there, the tokenization logic would +change significantly. Wanna try? +3. Try handling comments. + +## Up next + +In the next and last chapter of this story arc, we build a parser and, with the +help of the assembler and tokenizer, will be able to compile the `foo` function +to an executable i386 word! + +[^1]: I won't be explaining the struct system in details in this article. You're +welcome to read the docs about it, of course, but it's not critical. Suffice it +to say that when we call a structbind word, we "activate" its struct namespace, +so the next word is a kind of "method" call. +[^2]: this doesn't mean that Dusk can only open one file at once. It can open +more than one, just not through the `file` global structbind. +[^3]: To non-whitespace +[^4]: which is *not* a whitespace because all numbers in Dusk are unsigned. So +`-1 SPC <=` is false. +[^5]: I know, some symbols in C contain more than one character, but not in +foo.c. Because our tokenizer is minimal, we can afford to stay simple. +[^6]: it only works for a single character. +[^7]: a structbind to the `Scratchpad` struct. + +[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/10-beast.tar.gz +[prev]: 09-dusktillc.html +[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt +[struct]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/struct.txt +[io]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/io.txt +[file]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/file.txt +[alloc]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/alloc.txt +[scratch]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/scratch.txt + diff --git a/01-duskcc/10-beast/foo.c b/01-duskcc/10-beast/foo.c new file mode 100644 index 0000000..295c1f6 --- /dev/null +++ b/01-duskcc/10-beast/foo.c @@ -0,0 +1,3 @@ +int foo(int a, int b) { + return a + b; +} diff --git a/01-duskcc/10-beast/myasm.fs b/01-duskcc/10-beast/myasm.fs new file mode 100644 index 0000000..1094528 --- /dev/null +++ b/01-duskcc/10-beast/myasm.fs @@ -0,0 +1,12 @@ +0 const eax +1 const ecx +2 const edx +3 const ebx +4 const esp +5 const ebp +6 const esi +7 const edi + +: ret, $c3 c, ; +: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ; +: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ; diff --git a/01-duskcc/10-beast/tok.fs b/01-duskcc/10-beast/tok.fs new file mode 100644 index 0000000..6da7a77 --- /dev/null +++ b/01-duskcc/10-beast/tok.fs @@ -0,0 +1,17 @@ +: isWS? ( c -- f ) SPC <= ; +create symbols ," +(){};," +: isSym? ( c -- f ) symbols 7 [c]? 0>= ; +: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ; + +$40 const MAXTOKSZ \ maximum size for tokens +create buf MAXTOKSZ 1+ allot + +: tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ; +: nextt ( -- str-or-0 ) + buf 1+ tonws ( a c ) + dup isSym? if swap c! 1 else begin ( a c ) + dup boundary? not while ( a c ) + swap c!+ file :getc repeat ( a c ) + file :putback ( a ) buf - 1- then ( length ) + dup buf c! ( length ) if buf syspad :s, else 0 then ; +: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ; diff --git a/01-duskcc/main.css b/01-duskcc/main.css index 3edcec4..0a302f6 100644 --- a/01-duskcc/main.css +++ b/01-duskcc/main.css @@ -18,3 +18,11 @@ table td, table th { blockquote { font-style: italic; } + +code { + background-color: #eeeeee +} + +pre code { + display: block; +} diff --git a/Makefile b/Makefile index ca553ef..2ad76be 100644 --- a/Makefile +++ b/Makefile @@ -8,7 +8,8 @@ ARTICLES_WITH_TGZ = \ 01-duskcc/06-taletwostacks \ 01-duskcc/07-babywalk \ 01-duskcc/08-immediate \ - 01-duskcc/09-dusktillc + 01-duskcc/09-dusktillc \ + 01-duskcc/10-beast ARTICLES = \ $(ARTICLES_WITH_TGZ) diff --git a/index.html b/index.html index a4cde16..7109767 100644 --- a/index.html +++ b/index.html @@ -54,7 +54,19 @@ before you receive the bundle.

Buckle up, Dorothy (in progress)

-My “pilot” story arc is on the subject of Dusk OS’ C compiler. +In my “pilot” story arc, we peek in +disgust in the abyss of modern software complexity and escape this dystopia by +tumbling down the rabbit hole of low level development. +

+

+Starting from bare metal on the PC platform, we build a Forth from scratch, then +switch to Dusk OS and then build a partial C +compiler (just enough to compile our example code), again from scratch. +

+

+The "teaser" part of this story arc is a rather large part of the whole, but +that's because it's the "build a Forth" part and I believe that everyone should +do that, so I want to keep it openly accessible.

Table of Contents

@@ -68,7 +80,8 @@ My “pilot” story arc is on the subj
  • Baby's first steps
  • The Unbearable Immediateness of Compiling
  • From Dusk Till C
  • -
  • ... to write ...
  • +
  • Feeding the beast
  • +
  • In the eye of the compiler (not written yet)