~vdupras/tumbleforth

70acecdae3a8741d3050ee657372d3201baa5242 — Virgil Dupras 1 year, 3 months ago be4aa20
01-duskcc/11-eye
M 01-duskcc/10-beast.md => 01-duskcc/10-beast.md +4 -3
@@ 293,9 293,9 @@ change significantly. Wanna try?

## Up next

In the next and last chapter of this story arc, we build a parser and, with the
help of the assembler and tokenizer, will be able to compile the `foo` function
to an executable i386 word!
In the [next and last chapter][upnext] of this story arc, we build a parser
and, with the help of the assembler and tokenizer, will be able to compile the
`foo` function to an executable i386 word!

[^1]: I won't be explaining the struct system in details in this article. You're
welcome to read the docs about it, of course, but it's not critical. Suffice it


@@ 313,6 313,7 @@ foo.c. Because our tokenizer is minimal, we can afford to stay simple.

[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/10-beast.tar.gz
[prev]: 09-dusktillc.html
[upnext]: 11-eye.html
[usage]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/usage.txt
[struct]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/struct.txt
[io]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/sys/io.txt

A 01-duskcc/11-eye.md => 01-duskcc/11-eye.md +234 -0
@@ 0,0 1,234 @@
# [Tumble Forth](/): In the Eye of the Compiler

We have an assembler and [a tokenizer][prev], we can proceed with the final step
of our compiler, the parse+generate[^1] one!

Because our compiler aims to minimally compile the `foo()` function, it's
actually going to be quite simple. The parsing of the function can be divided in
four main parts:

1. return type
2. name
3. arguments
4. body

## Return type

Because our function involves only one type, this part of the parsing is
actually a no-op. It has no purpose because our generation code is going to
hardcode for the `int` type. Therefore, all we need to do is to consume the
token and assert that it's `int`.

This task is of course trivial, but what is less trivial is to design the API
for our C compiler. [My proposition][srctgz] is this, which you can put in a
new unit called `mycc.fs`:

    ?f<< myasm.fs
    ?f<< tok.fs
    : err abort" CC error" ;
    : expect ( tok str -- ) s= not if err then ;
    : parseType ( -- ) nextt S" int" expect ;
    : cc parseType tokstype ;

This allows you to do:

    f<< mycc.fs
    f" foo.c" cc

As you can see, the main entry point of our C compiler would be the word `cc`,
which expects the `file` global to contain the C source to compile.

At this stage, what `cc` would do is to check that the function return type is
correct, then spit the rest of the file as unparsed tokens for debugging.

If you modify `foo.c` to give `foo` another return type, you'll see that `cc`
will correctly raise an error.

Code wise, there are two words you haven't seen yet: `?f<<` and `s=`.

`?f<<` loads the specified Forth source file only if it hasn't been loaded yet. 
In Dusk, this is systematically used in units to document dependencies.

`s= ( str1 str2 -- f )` compares two strings and return whether they have the
same contents and length.

## Name

The next token coming up is the function name. Handling this one is also simple
as all we need to do is to create a new dictionary entry with the same name. It
can be done thus:

    : parseName ( -- ) sysdict nextt entry ;
    : cc parseType parseName tokstype ;

After you've amended `mycc.fs`, you can run it again and see that it consumed
the `foo` token and created a `foo` word (which you can verify with `' foo`).
This word is of course broken because it's empty, don't try to call it.

And that's all there is to it. If we wanted our CC to support function call,
we'd want to bind this name to a signature structure somehow, and if we wanted
to support the `static` keyword, we'd want to conditionally create this entry.
But since we don't want all this, we can afford to stay simple.

Code wise, there are two words you haven't seen yet, `entry` and `sysdict`.

`entry ( 'dict str -- )` is the low level word to create a dictionary entry.
Instead of reading its name from the input stream like `code` and `:`, it gets
it from a string supplied from `PS`. It also has the ability to create the entry
in *any* dictionary, not just the system one. In this case, what we're after is
the former ability, not the latter, because we want to create our word in the
system dictionary.

Which brings us to `sysdict`, which is a constant that yields the address of a
pointer to the lastest entry to the system dictionary. It's the exact same thing
as the `dictionary` variable in our baby Forth. The `entry` word updates that
pointer when it's finished creating the entry.

## Arguments

Our parser will handle an arbitrary number of arguments, all `int`s, and keep
that list somewhere in memory so that we can map an identifier name to a `ESI`
offset or `EAX`. The code looks like this:

    ?f<< /lib/str.fs
    $100 const ARGSLEN
    create args ARGSLEN allot
    : parseArgs ( -- )
      args nextt S" (" expect begin ( a )
        parseType nextt over strmove s) nextt dup S" ," s= while ( a tok )
        drop repeat ( a tok )
      S" )" expect 0 swap c! ;
    : cc parseType parseName parseArgs tokstype ;

With these additions, our `cc` word will consume the arguments and create a
structure defined by Dusk's ["str" library][str], the String List. That list is
very simple, it's a list of strings that follow each other in memory, ended by a
null string (a string with a 0 length field). You can vizualize that list by
dumping the contents of `args` after having run `cc`:

    args dump
    :0000d299 0161 0162 0000 0000 0000 0000 0000 0000 .a.b............
    ...

Such a string list can easily be iterated upon and thus fulfills our
requirement to map names to indexes. The "str" library has a word for this,
`sfind ( str list -- idx )` which returns the index of the specified string in
the list, or -1 if not found:

    S" a" args sfind . \ prints 0
    S" b" args sfind . \ prints 1
    S" hello" args sfind . \ prints -1

Memory-wise, you can see that I chose a static buffer for simplicity, with the
tradeoff that it can only handle `ARGSLEN` bytes in its list. Parsing a
function that busts this limit will result in memory corruption.

Code-wise, we have once again two new words, `s)` and `strmove`.

`s) ( str -- a )` simply jumps over the specified string and returns the
address directly following its last character. This allows easy iteration of a
string list.

`strmove ( str dst -- )` is a utility wrapper around `move` and copies the
contents of `str`, including its length byte, to address `dst`.

## Body

We've now reached the crucial point of our process where we can finally answer
the question at the root of this tumbling down adventure, that is, how is the
C source of our `foo()` function compiled and then ran?

    : assert ( f -- ) not if err then ;
    : parseExpression ( -- ) 
      nextt args sfind nextt S" +" expect nextt args sfind ( idx1 idx2 )
      ?swap 1 = assert 0 = assert ( )
      eax esi addr[], ;
    : argscnt ( -- cnt ) 0 args begin dup c@ while swap 1+ swap s) repeat drop ;
    : parseStatement ( -- )
      nextt S" return" expect parseExpression nextt S" ;" expect
      esi argscnt 1- 4 * addri, ret, ;
    : parseBody nextt S" {" expect parseStatement nextt S" }" expect ;
    : cc parseType parseName parseArgs parseBody ;

If you've been using the POSIX VM, now is the time to switch back to `make
pcrun` because the POSIX VM can't run i386 code:

    f<< mycc.fs
    f" foo.c"
    cc
    42 54 foo . \ prints 96
    .S \ shows that there is no PS leak

Code-wise, there's only one word you haven't seen yet, `?swap ( a b -- lo hi )`
which sorts the top two items of `PS`. This allows us to easily check if our
expression contains both argument indexes 0 and 1 regardless of the order.

The rest of the code is usage of the i386 assembler you've written yourself.

So this is it! it's done! *Mission Accomplished!* as they say in America with
much fanfare.

Of course, this parsing code only works for a tiny
subset of the C language. It can't:

1. Have more than one statement in its body.
2. Have a statement other than `return`.
3. Have an empty `return` statement.
4. Have a return type or argument types other than `int`.
5. Parse an expression that isn't a `+` binary operations with argument
references as its two terms.

However, it's not a complete sham and does have a bit of leeway:

1. It checks that the identifiers in the expression are correct. When they're
not, the code is going to raise an error instead of generating the wrong code.
2. The function can have more than two arguments. You can't use them in the
expression, but the `return` handler will still properly consume those
arguments from `PS`.
3. The function can have any name.
4. `b + a` will work too.

But what is more important is that we can already see how we'd go around
improving the compiler.

We'd want to use a third argument in the expression? Add displacement support
to the assembler (to allow references like `[esi+4]`) and conditionally use this
capability in `parseExpression`.

We'd want support for constant numbers? Try `parse ( str -- n f )` on the token
and if it succeeds, conditionally use the `addri,` variant using that number
instead of `addr[],`.

We want to support `-`? Instead of expecting a "+" token, add a condition that
maps `+` to `add` and `-` to `sub`. This of course requires an improved
assembler.

... and so on. But these improvements and the ones that follow are deep subjects
and need separate story arcs.

## Conclusion

We've reached the end of this story arc by answering the initial question, that
is, how can this simple piece of C code be compiled and ran? We've also managed
to generate code that's faster than GCC's or clang's because we're freed from
UNIX's wasteful calling conventions. We also manage to answer the "running"
part of the question by using a system that is so much simpler than UNIX that
it becomes "just call the address!".

Of course, the challenge of keeping up with modern C compilers in terms of
performance quickly becomes much harder as the compiled code becomes larger,
but isn't it refreshing to see that, at the heart of it, the task of compiling
C code in a sound manner isn't impossibly complex?

I hope that this pilot story arc helped you to demystify the subjects of low
level computing, assembly and compilation and motivated you to dig these
subjects further, which we'll do in further story arcs.

[^1]: Some compilers, most even, have separate steps for parsing and code
generation, using an Abstract Syntax Tree representation of the code in between.
We don't (DuskCC doesn't either). We generate code directly as we parse. It
comes with a few drawbacks, but results in a much simpler code.

[srctgz]: https://tumbleforth.hardcoded.net/01-duskcc/11-eye.tar.gz
[prev]: 10-beast.html
[str]: https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/lib/str.txt

A 01-duskcc/11-eye/foo.c => 01-duskcc/11-eye/foo.c +3 -0
@@ 0,0 1,3 @@
int foo(int a, int b) {
    return a + b;
}

A 01-duskcc/11-eye/myasm.fs => 01-duskcc/11-eye/myasm.fs +12 -0
@@ 0,0 1,12 @@
0 const eax
1 const ecx
2 const edx
3 const ebx
4 const esp
5 const ebp
6 const esi
7 const edi

: ret, $c3 c, ;
: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ;
: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ;

A 01-duskcc/11-eye/mycc.fs => 01-duskcc/11-eye/mycc.fs +25 -0
@@ 0,0 1,25 @@
?f<< /lib/str.fs
?f<< myasm.fs
?f<< tok.fs
$100 const ARGSLEN
create args ARGSLEN allot
: err abort" CC error" ;
: assert ( f -- ) not if err then ;
: expect ( tok str -- ) s= not if err then ;
: parseType ( -- ) nextt S" int" expect ;
: parseName ( -- ) sysdict nextt entry ;
: argscnt ( -- cnt ) 0 args begin dup c@ while swap 1+ swap s) repeat drop ;
: parseArgs ( -- )
  args nextt S" (" expect begin ( a )
    parseType nextt over strmove s) nextt dup S" ," s= while ( a tok )
    drop repeat ( a tok )
  S" )" expect 0 swap c! ;
: parseExpression ( -- )
  nextt args sfind nextt S" +" expect nextt args sfind ( idx1 idx2 )
  ?swap 1 = assert 0 = assert ( )
  eax esi addr[], ;
: parseStatement ( -- )
  nextt S" return" expect parseExpression nextt S" ;" expect
  esi argscnt 1- 4 * addri, ret, ;
: parseBody nextt S" {" expect parseStatement nextt S" }" expect ;
: cc parseType parseName parseArgs parseBody ;

A 01-duskcc/11-eye/tok.fs => 01-duskcc/11-eye/tok.fs +17 -0
@@ 0,0 1,17 @@
: isWS? ( c -- f ) SPC <= ;
create symbols ," +(){};,"
: isSym? ( c -- f ) symbols 7 [c]? 0>= ;
: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;

$40 const MAXTOKSZ \ maximum size for tokens
create buf MAXTOKSZ 1+ allot

: tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
: nextt ( -- str-or-0 )
  buf 1+ tonws ( a c )
  dup isSym? if swap c! 1 else begin ( a c )
      dup boundary? not while ( a c )
      swap c!+ file :getc repeat ( a c )
    file :putback ( a ) buf - 1- then ( length )
  dup buf c! ( length ) if buf syspad :s, else 0 then ;
: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;

M Makefile => Makefile +2 -1
@@ 9,7 9,8 @@ ARTICLES_WITH_TGZ = \
	01-duskcc/07-babywalk \
	01-duskcc/08-immediate \
	01-duskcc/09-dusktillc \
	01-duskcc/10-beast
	01-duskcc/10-beast \
	01-duskcc/11-eye
ARTICLES = \
	$(ARTICLES_WITH_TGZ)


M index.html => index.html +2 -2
@@ 51,7 51,7 @@ before you receive the bundle.</p>

<h2>Story arcs</h2>

<h3>Buckle up, Dorothy (in progress)</h3>
<h3>Buckle up, Dorothy</h3>

<p>
In my <a href="01-duskcc/01-buckleup.html">“pilot” story arc</a>, we peek in


@@ 81,7 81,7 @@ do that, so I want to keep it openly accessible.
    <li><a href="01-duskcc/08-immediate.html">The Unbearable Immediateness of Compiling</a></li>
    <li>From Dusk Till C</li>
    <li>Feeding the beast</li>
    <li>In the eye of the compiler (not written yet)</li>
    <li>In the Eye of the Compiler</li>
</ol>

<script async