M 01-duskcc/10-beast.md => 01-duskcc/10-beast.md +4 -3
@@ 293,9 293,9 @@ change significantly. Wanna try?
## Up next
-In the next and last chapter of this story arc, we build a parser and, with the
-help of the assembler and tokenizer, will be able to compile the `foo` function
-to an executable i386 word!
+In the [next and last chapter][upnext] of this story arc, we build a parser
+and, with the help of the assembler and tokenizer, will be able to compile the
+`foo` function to an executable i386 word!
[^1]: I won't be explaining the struct system in details in this article. You're
welcome to read the docs about it, of course, but it's not critical. Suffice it
@@ 313,6 313,7 @@ foo.c. Because our tokenizer is minimal, we can afford to stay simple.
[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/10-beast.tar.gz
[prev]: 09-dusktillc.html
+[upnext]: 11-eye.html
[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt
[struct]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/struct.txt
[io]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/io.txt
A 01-duskcc/11-eye.md => 01-duskcc/11-eye.md +234 -0
@@ 0,0 1,234 @@
+# [Tumble Forth](/): In the Eye of the Compiler
+
+We have an assembler and [a tokenizer][prev], we can proceed with the final step
+of our compiler, the parse+generate[^1] one!
+
+Because our compiler aims to minimally compile the `foo()` function, it's
+actually going to be quite simple. The parsing of the function can be divided in
+four main parts:
+
+1. return type
+2. name
+3. arguments
+4. body
+
+## Return type
+
+Because our function involves only one type, this part of the parsing is
+actually a no-op. It has no purpose because our generation code is going to
+hardcode for the `int` type. Therefore, all we need to do is to consume the
+token and assert that it's `int`.
+
+This task is of course trivial, but what is less trivial is to design the API
+for our C compiler. [My proposition][srctgz] is this, which you can put in a
+new unit called `mycc.fs`:
+
+ ?f<< myasm.fs
+ ?f<< tok.fs
+ : err abort" CC error" ;
+ : expect ( tok str -- ) s= not if err then ;
+ : parseType ( -- ) nextt S" int" expect ;
+ : cc parseType tokstype ;
+
+This allows you to do:
+
+ f<< mycc.fs
+ f" foo.c" cc
+
+As you can see, the main entry point of our C compiler would be the word `cc`,
+which expects the `file` global to contain the C source to compile.
+
+At this stage, what `cc` would do is to check that the function return type is
+correct, then spit the rest of the file as unparsed tokens for debugging.
+
+If you modify `foo.c` to give `foo` another return type, you'll see that `cc`
+will correctly raise an error.
+
+Code wise, there are two words you haven't seen yet: `?f<<` and `s=`.
+
+`?f<<` loads the specified Forth source file only if it hasn't been loaded yet.
+In Dusk, this is systematically used in units to document dependencies.
+
+`s= ( str1 str2 -- f )` compares two strings and return whether they have the
+same contents and length.
+
+## Name
+
+The next token coming up is the function name. Handling this one is also simple
+as all we need to do is to create a new dictionary entry with the same name. It
+can be done thus:
+
+ : parseName ( -- ) sysdict nextt entry ;
+ : cc parseType parseName tokstype ;
+
+After you've amended `mycc.fs`, you can run it again and see that it consumed
+the `foo` token and created a `foo` word (which you can verify with `' foo`).
+This word is of course broken because it's empty, don't try to call it.
+
+And that's all there is to it. If we wanted our CC to support function call,
+we'd want to bind this name to a signature structure somehow, and if we wanted
+to support the `static` keyword, we'd want to conditionally create this entry.
+But since we don't want all this, we can afford to stay simple.
+
+Code wise, there are two words you haven't seen yet, `entry` and `sysdict`.
+
+`entry ( 'dict str -- )` is the low level word to create a dictionary entry.
+Instead of reading its name from the input stream like `code` and `:`, it gets
+it from a string supplied from `PS`. It also has the ability to create the entry
+in *any* dictionary, not just the system one. In this case, what we're after is
+the former ability, not the latter, because we want to create our word in the
+system dictionary.
+
+Which brings us to `sysdict`, which is a constant that yields the address of a
+pointer to the lastest entry to the system dictionary. It's the exact same thing
+as the `dictionary` variable in our baby Forth. The `entry` word updates that
+pointer when it's finished creating the entry.
+
+## Arguments
+
+Our parser will handle an arbitrary number of arguments, all `int`s, and keep
+that list somewhere in memory so that we can map an identifier name to a `ESI`
+offset or `EAX`. The code looks like this:
+
+ ?f<< /lib/str.fs
+ $100 const ARGSLEN
+ create args ARGSLEN allot
+ : parseArgs ( -- )
+ args nextt S" (" expect begin ( a )
+ parseType nextt over strmove s) nextt dup S" ," s= while ( a tok )
+ drop repeat ( a tok )
+ S" )" expect 0 swap c! ;
+ : cc parseType parseName parseArgs tokstype ;
+
+With these additions, our `cc` word will consume the arguments and create a
+structure defined by Dusk's ["str" library][str], the String List. That list is
+very simple, it's a list of strings that follow each other in memory, ended by a
+null string (a string with a 0 length field). You can vizualize that list by
+dumping the contents of `args` after having run `cc`:
+
+ args dump
+ :0000d299 0161 0162 0000 0000 0000 0000 0000 0000 .a.b............
+ ...
+
+Such a string list can easily be iterated upon and thus fulfills our
+requirement to map names to indexes. The "str" library has a word for this,
+`sfind ( str list -- idx )` which returns the index of the specified string in
+the list, or -1 if not found:
+
+ S" a" args sfind . \ prints 0
+ S" b" args sfind . \ prints 1
+ S" hello" args sfind . \ prints -1
+
+Memory-wise, you can see that I chose a static buffer for simplicity, with the
+tradeoff that it can only handle `ARGSLEN` bytes in its list. Parsing a
+function that busts this limit will result in memory corruption.
+
+Code-wise, we have once again two new words, `s)` and `strmove`.
+
+`s) ( str -- a )` simply jumps over the specified string and returns the
+address directly following its last character. This allows easy iteration of a
+string list.
+
+`strmove ( str dst -- )` is a utility wrapper around `move` and copies the
+contents of `str`, including its length byte, to address `dst`.
+
+## Body
+
+We've now reached the crucial point of our process where we can finally answer
+the question at the root of this tumbling down adventure, that is, how is the
+C source of our `foo()` function compiled and then ran?
+
+ : assert ( f -- ) not if err then ;
+ : parseExpression ( -- )
+ nextt args sfind nextt S" +" expect nextt args sfind ( idx1 idx2 )
+ ?swap 1 = assert 0 = assert ( )
+ eax esi addr[], ;
+ : argscnt ( -- cnt ) 0 args begin dup c@ while swap 1+ swap s) repeat drop ;
+ : parseStatement ( -- )
+ nextt S" return" expect parseExpression nextt S" ;" expect
+ esi argscnt 1- 4 * addri, ret, ;
+ : parseBody nextt S" {" expect parseStatement nextt S" }" expect ;
+ : cc parseType parseName parseArgs parseBody ;
+
+If you've been using the POSIX VM, now is the time to switch back to `make
+pcrun` because the POSIX VM can't run i386 code:
+
+ f<< mycc.fs
+ f" foo.c"
+ cc
+ 42 54 foo . \ prints 96
+ .S \ shows that there is no PS leak
+
+Code-wise, there's only one word you haven't seen yet, `?swap ( a b -- lo hi )`
+which sorts the top two items of `PS`. This allows us to easily check if our
+expression contains both argument indexes 0 and 1 regardless of the order.
+
+The rest of the code is usage of the i386 assembler you've written yourself.
+
+So this is it! it's done! *Mission Accomplished!* as they say in America with
+much fanfare.
+
+Of course, this parsing code only works for a tiny
+subset of the C language. It can't:
+
+1. Have more than one statement in its body.
+2. Have a statement other than `return`.
+3. Have an empty `return` statement.
+4. Have a return type or argument types other than `int`.
+5. Parse an expression that isn't a `+` binary operations with argument
+references as its two terms.
+
+However, it's not a complete sham and does have a bit of leeway:
+
+1. It checks that the identifiers in the expression are correct. When they're
+not, the code is going to raise an error instead of generating the wrong code.
+2. The function can have more than two arguments. You can't use them in the
+expression, but the `return` handler will still properly consume those
+arguments from `PS`.
+3. The function can have any name.
+4. `b + a` will work too.
+
+But what is more important is that we can already see how we'd go around
+improving the compiler.
+
+We'd want to use a third argument in the expression? Add displacement support
+to the assembler (to allow references like `[esi+4]`) and conditionally use this
+capability in `parseExpression`.
+
+We'd want support for constant numbers? Try `parse ( str -- n f )` on the token
+and if it succeeds, conditionally use the `addri,` variant using that number
+instead of `addr[],`.
+
+We want to support `-`? Instead of expecting a "+" token, add a condition that
+maps `+` to `add` and `-` to `sub`. This of course requires an improved
+assembler.
+
+... and so on. But these improvements and the ones that follow are deep subjects
+and need separate story arcs.
+
+## Conclusion
+
+We've reached the end of this story arc by answering the initial question, that
+is, how can this simple piece of C code be compiled and ran? We've also managed
+to generate code that's faster than GCC's or clang's because we're freed from
+UNIX's wasteful calling conventions. We also manage to answer the "running"
+part of the question by using a system that is so much simpler than UNIX that
+it becomes "just call the address!".
+
+Of course, the challenge of keeping up with modern C compilers in terms of
+performance quickly becomes much harder as the compiled code becomes larger,
+but isn't it refreshing to see that, at the heart of it, the task of compiling
+C code in a sound manner isn't impossibly complex?
+
+I hope that this pilot story arc helped you to demystify the subjects of low
+level computing, assembly and compilation and motivated you to dig these
+subjects further, which we'll do in further story arcs.
+
+[^1]: Some compilers, most even, have separate steps for parsing and code
+generation, using an Abstract Syntax Tree representation of the code in between.
+We don't (DuskCC doesn't either). We generate code directly as we parse. It
+comes with a few drawbacks, but results in a much simpler code.
+
+[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/11-eye.tar.gz
+[prev]: 10-beast.html
+[str]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/str.txt
A 01-duskcc/11-eye/foo.c => 01-duskcc/11-eye/foo.c +3 -0
@@ 0,0 1,3 @@
+int foo(int a, int b) {
+ return a + b;
+}
A 01-duskcc/11-eye/myasm.fs => 01-duskcc/11-eye/myasm.fs +12 -0
@@ 0,0 1,12 @@
+0 const eax
+1 const ecx
+2 const edx
+3 const ebx
+4 const esp
+5 const ebp
+6 const esi
+7 const edi
+
+: ret, $c3 c, ;
+: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ;
+: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ;
A 01-duskcc/11-eye/mycc.fs => 01-duskcc/11-eye/mycc.fs +25 -0
@@ 0,0 1,25 @@
+?f<< /lib/str.fs
+?f<< myasm.fs
+?f<< tok.fs
+$100 const ARGSLEN
+create args ARGSLEN allot
+: err abort" CC error" ;
+: assert ( f -- ) not if err then ;
+: expect ( tok str -- ) s= not if err then ;
+: parseType ( -- ) nextt S" int" expect ;
+: parseName ( -- ) sysdict nextt entry ;
+: argscnt ( -- cnt ) 0 args begin dup c@ while swap 1+ swap s) repeat drop ;
+: parseArgs ( -- )
+ args nextt S" (" expect begin ( a )
+ parseType nextt over strmove s) nextt dup S" ," s= while ( a tok )
+ drop repeat ( a tok )
+ S" )" expect 0 swap c! ;
+: parseExpression ( -- )
+ nextt args sfind nextt S" +" expect nextt args sfind ( idx1 idx2 )
+ ?swap 1 = assert 0 = assert ( )
+ eax esi addr[], ;
+: parseStatement ( -- )
+ nextt S" return" expect parseExpression nextt S" ;" expect
+ esi argscnt 1- 4 * addri, ret, ;
+: parseBody nextt S" {" expect parseStatement nextt S" }" expect ;
+: cc parseType parseName parseArgs parseBody ;
A 01-duskcc/11-eye/tok.fs => 01-duskcc/11-eye/tok.fs +17 -0
@@ 0,0 1,17 @@
+: isWS? ( c -- f ) SPC <= ;
+create symbols ," +(){};,"
+: isSym? ( c -- f ) symbols 7 [c]? 0>= ;
+: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
+
+$40 const MAXTOKSZ \ maximum size for tokens
+create buf MAXTOKSZ 1+ allot
+
+: tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
+: nextt ( -- str-or-0 )
+ buf 1+ tonws ( a c )
+ dup isSym? if swap c! 1 else begin ( a c )
+ dup boundary? not while ( a c )
+ swap c!+ file :getc repeat ( a c )
+ file :putback ( a ) buf - 1- then ( length )
+ dup buf c! ( length ) if buf syspad :s, else 0 then ;
+: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;
M Makefile => Makefile +2 -1
@@ 9,7 9,8 @@ ARTICLES_WITH_TGZ = \
01-duskcc/07-babywalk \
01-duskcc/08-immediate \
01-duskcc/09-dusktillc \
- 01-duskcc/10-beast
+ 01-duskcc/10-beast \
+ 01-duskcc/11-eye
ARTICLES = \
$(ARTICLES_WITH_TGZ)
M index.html => index.html +2 -2
@@ 51,7 51,7 @@ before you receive the bundle.</p>
<h2>Story arcs</h2>
-<h3>Buckle up, Dorothy (in progress)</h3>
+<h3>Buckle up, Dorothy</h3>
<p>
In my <a href="01-duskcc/01-buckleup.html">“pilot” story arc</a>, we peek in
@@ 81,7 81,7 @@ do that, so I want to keep it openly accessible.
<li><a href="01-duskcc/08-immediate.html">The Unbearable Immediateness of Compiling</a></li>
<li>From Dusk Till C</li>
<li>Feeding the beast</li>
- <li>In the eye of the compiler (not written yet)</li>
+ <li>In the Eye of the Compiler</li>
</ol>
<script async