~elektito/starforth

Native code, self-hosting compiler for a stack-based, Forth-like programming language.
Make small style fix
Add a "runtime" section to README

clone

read-only
https://git.sr.ht/~elektito/starforth
read/write
git@git.sr.ht:~elektito/starforth

You can also use your local clone with git send-email.

#starforth

starforth is a stack-based, Forth-like programming language, with a self-hosting compiler. The project is simply an exercise in bootstrapping a very simple compiler out of a very small bootstrap binary, and maybe growing from there. I chose Forth as the base language because it's one of the simplest programming languages imaginable.

#Forth-ness

Starforth has only a passive resemblance to Forth. It does not have an interpreter, not an interactive environment, and no dictionary in the sense Forths have. Latest versions do have immediates but they are more limited than usual Forth immediates.

Starforth does not produce threaded code, but actual machine code, unlike many other Forths.

Another small difference is the truth value. Starforth represents true with 1, unlike many other Forths that use -1. This was chosen because x86 instruction set makes using 1 as the true value slightly simpler.

#The Language

Starforth uses many of the same terminology and traditions as "real" Forth languages. Among them is the term "word" being used for what most other languages might call a "function".

Starforth has two stacks: the parameter stack and the return stack. Whenever we mention "the stack", the parameter stack is meant. Return stack is normally used to keep the return address after a word is called, but it can be used for other purposes too. The values read from/written to the stacks are always full cells (4 bytes).

In the following description, many built-ins are explained by a stack notation. For example ( n1 n2 -- n1+n2 ) means that at run-time two integer values are popped from the stack and another is pushed (which is the result of adding the two values). In the stack notation, n means an integer (or a signed integer if it matters). a means an address. u means an explicitly unsigned integer. b means a boolean. c means a character. Characters are still cell sized on the stack, but only their least significant byte is used.

We might also use the terms "tos" and "nos". "tos" refers to the item on top of the stack, while "nos" is the item directly behind it.

Starforth has two kinds of comments: Comments starting with the \ token expand until the end of the line. Comments starting with the ( token expand until a ) character is encountered.

When the compiler reads a token it first tries to find a definition by that name. If one is found, its compile-time semantics are performed. This is usually to generate code that calls the word at run-time. For immediate words however, the definition is actually executed at compile-time. Constants emit code to push a value at run-time, while variables emit code that pushes an address at run-time.

If no user-defined word is found, we then try the compiler built-ins. These will be compiled to machine code.

If the token is not a user-defined word or a built-in, the compiler then checks if it's a character literal. These are in the form 'x' where x is any ascii character. The emitted code for these is the same as using the ascii code directly. For example, 'A' is the same as 65.

Finally if any of the above does not hold, the compiler tries interpreting the token as a number. This could start with a sign character, followed by a number of digits, and can optionally end with h. If the number ends with h, it is interpreted as a hexadecimal number, otherwise as a decimal number.

After bootstrapping from a very limited number of words, the current version of Starforth has an expanded set of built-ins that are directly compiled to machine code.

#Built-ins

  • : Start a word definition. The token after : is the name of the word.
  • ; End a word definition
  • exit Return from the current definition.
  • immediate Mark current word as an immediate. Must be used right after word name, like in: : foo immediate ... ;. The word body is compiled to a memory buffer. Whenever the word is encountered during compilation, the body is run in the compiler's context. Immediates have access to a number of compile-time built-ins that are not available to non-immediate words. These all begin with a c/ prefix.
  • constant Mark current word as a constant. Must be used right after word name, like in: : foo constant 2 2 + ;. The word body is compiled, and run, and the single value it puts on the stack is inserted as a literal whenever the word is used.
  • variable Mark current word as a variable. Must be used right after word name, like in: : foo variable cell ; The word body is compiled, and run, and the value it pushes on the stack is taken as the number of bytes to be allocated for the variable.
  • " Mark current word as a string. Must be used right after word name. Anything that comes after " and a single whitespace is taken to be part of the string until another " is encountered. A semicolon can also follow though not required. As an example, the definition: : foobar " foo bar" ; creates a string named foobar with the value foo bar. Strings in Starforth are length-prefixed. The first cell in the memory space allocated for the string is the length, which is followed by the number of bytes it specifies.
  • cells ( n -- n*4 ) multiply tos by cell size (4).
  • ps@ ( -- a ) Push the current value of the parameter stack pointer
  • ps! ( a -- ) Write a new address to the parameter stack pointer
  • rs@ ( -- a ) Push the current value of the return stack pointer
  • rs! ( a -- ) Write a new address to the parameter stack pointer
  • @ ( a -- n ) Read the value of a cell (4 bytes) from memory
  • ! ( n a -- ) Write a value to a cell in memory
  • c@ ( a -- c ) Read the value of a single byte from memory
  • c! ( c a -- ) Write a value to a byte in memory. The least significant byte of the value read from stack is used.
  • and ( n1 n2 -- n1&n2 ) Bitwise AND two integers
  • or ( n1 n2 -- n1|n2 ) Bitwise OR two integers
  • xor ( n1 n2 -- n1^n2 ) Bitwise XOR two integers
  • invert ( n -- !n ) Bitwise NOT the value on the stack
  • + ( n1 n2 -- n1+n2 ) Add two integers
  • - ( n1 n2 -- n1-n2 ) Subtract two integers
  • * ( n1 n2 -- n1*n2 ) Multiply two integers. notice that if the results do not fit in a cell, only the least significant cell is kept.
  • /mod ( n1 n2 -- n1%n2 n1/n2 ) Divide two signed integers, leaving remainder (nos) and quotient (tos) on the stack.
  • / ( n1 n2 -- n1/n2 ) Divide two signed integers, leaving the quotient on the stack.
  • mod ( n1 n2 -- n1%n2 ) Divide two signed integers, leaving the remainder on the stack.
  • u/mod ( u1 u2 -- u1%u2 u1/u2 ) Divide two unsigned integers, leaving remainder (nos) and quotient (tos) on the stack.
  • u/ ( u1 u2 -- u1/u2 ) Divide two unsigned integers, leaving the quotient on the stack.
  • umod ( u1 u2 -- u1%u2 ) Divide two unsigned integers, leaving the remainder on the stack.
  • negate ( n -- -n ) Negate tos
  • 1+ ( n -- n+1 ) Increment tos
  • 1- ( n -- n-1 ) Decrement tos
  • cell+ ( n -- n+4 ) Add cell size to tos
  • cell- ( n -- n-4 ) Subtract cell size from tos
  • lsr ( u1 u2 -- u1>>u2 ) Perform logical shift to the right. tos is the number of bits to shift, while nos is the number to be shifted.
  • lsl ( u1 u2 -- u1<<u2 ) Perform logical shift to the left. tos is the number of bits to shift, while nos is the number to be shifted.
  • asr ( n1 n2 -- n1>>n2 ) Perform arithmetic shift to the right. tos is the number of bits to shift, while nos is the number to be shifted.
  • asl ( n1 n2 -- n1>>n2 ) The same as lsl. This was simply added for symmetry!
  • mem ( -- a ) Push the pointer to the start of user memory. This is a block of memory (currently 32K) that is allocated on startup and can be used for any purposes. User variables take up part of this block. The address returned by mem points to the portion of the block after user variables.
  • ?exit ( b -- ) Return from current word if the top of stack is true (non-zero)
  • 0= ( n -- b ) Replace tos with true if it is zero, otherwise with false.
  • 0< ( n -- b ) Replace tos with true if it is negative, otherwise with false.
  • 0> ( n -- b ) Replace tos with true if it is positive, otherwise with false.
  • 0<= ( n -- b ) Replace tos with true if it is zero or negative, otherwise with false.
  • 0>= ( n -- b )Replace tos with true if it is zero or positive, otherwise with false.
  • 0<> ( n -- b )Replace tos with true if it is not zero, otherwise with false. This effectively converts a number to a canonical boolean.
  • = ( n1 n2 -- b ) Consume and compare tos and nos, push true if they are equal, false otherwise.
  • < ( n1 n2 -- b ) Consume and compare tos and nos, push true if nos is smaller than tos, false otherwise.
  • > ( n1 n2 -- b ) Consume and compare tos and nos, push true if nos is greater than tos, false otherwise.
  • <= ( n1 n2 -- b ) Consume and compare tos and nos, push true if nos is less than or equal to tos, false otherwise.
  • >= ( n1 n2 -- b ) Consume and compare tos and nos, push true if nos is greater than or equal to, false otherwise.
  • <> ( n1 n2 -- b ) Consume and compare tos and nos, push true if they are not equal, false otherwise.
  • u< ( u1 u2 -- b ) Like < but interprets nos and tos as unsigned numbers.
  • u> ( u1 u2 -- b ) Like > but interprets nos and tos as unsigned numbers.
  • u<= ( u1 u2 -- b ) Like <= but interprets nos and tos as unsigned numbers.
  • >= ( u1 u2 -- b ) Like >= but interprets nos and tos as unsigned numbers.
  • >r ( u1 u2 -- b ) Move a value from the parameter stack to the return stack.
  • r> ( u1 u2 -- b ) Move a value from the return stack to the parameter stack.
  • dup ( n -- n n ) Duplicate tos
  • swap ( n1 n2 -- n2 n1 ) Swap tos and nos
  • drop ( n -- ) Discard tos
  • over ( n1 n2 -- n1 n2 n1 ) Push a copy of nos on top of the stack
  • tuck ( n1 n2 -- n2 n1 n2 ) Place a copy of tos under nos.
  • rot ( n1 n2 n3 -- n2 n3 n1 ) Rotate top three items on the stack
  • -rot ( n1 n2 n3 -- n3 n1 n2 ) Rotate top three items on the stack backwards
  • 2dup ( n1 n2 -- n1 n2 n1 n2 ) Duplicate top two items on the stack
  • ?exit ( b -- ) Read tos. If true, exit current word.
  • mem= ( a1 a2 n -- ) Compare the two memory regions of size n at a1 and a2.
  • memcpy ( a-dst a-src n -- ) Copy n bytes from the address a-src to address a-dst.
  • syscall0-syscall6 Perform a Linux system call with 0 to 6 arguments. tos is system call number, behind that come the system call arguments, so nos will be the first argument, the item behind it the second argument, and so forth. As an example, in order to exit the program using the "exit" system call (number 1 in i386 ABI), with exit code 42, you can use 42 1 syscall1.
  • tail if used before a non-built-in word, causes the compiler to emit a jmp instruction instead of a call. This can be used to eliminate tail calls and allow functions to recurs without exhausting the stack.
  • breakpoint Emit a breakpoint instruction (int 3).
  • ['] ( -- a ) Read the next word name following ['] at compile-time and emit code to push its address. The word can later be executed using this address by the execute built-in.
  • execute ( a -- ) Read the address on the stack and call it.

The following built-ins are only available to immediates. Notice that "the stack" here means the compiler's parameter stack, not the run-time stack.

  • c/lit: Read a value from the stack and emit code that pushes that value to the stack.
  • c/postpone: Emit code for the token after c/postpone. For example, the following immediate emits code that adds 20 to the number on the stack: : 20+ immediate 20 c/lit c/postpone + ;
  • c/savepoint: push two values on the stack denoting the current point in the output code buffer. This pair of values can later be used to emit a backwards jump to this location.
  • c/emit-jmp: Read two values from the stack denoting a point in the code, as generated by c/savepoint, and emit an unconditional jump instruction to that point.
  • c/emit-jc: Similar to c/emit-jmp but emits a conditional jump that reads a value from the stack at run-time and makes the jump if it is true.
  • c/emit-backfilled-jmp: Emits a dummy unconditional jump to an unknown location. Puts two values on the stack that can be used find and "backfill" the jump target later.
  • c/emit-backfilled-jc: Similar to c/emit-backfilled-jmp, but emits a conditional jump.
  • c/backfill: Reads two values from the stack denoting a dummy jump emitted by c/emit-backfilled-jmp or c/emit-backfilled-jc, and "backfills" the jump to point to the current location in the code buffer.
  • c/>c: Moves a value from "the stack" to the "control stack". The control stack is simply an extra stack available at compile time that can be used to store values. It's mainly used for storing values between two immediates. Since the immediates run in the context of the compiler, it is usually not possible to leave values on the normal stack and expect them to be there in another immediate. The control stack is not used by the compiler itself, so it can be used for whatever purpose you see fit.
  • c/c>: Moves a value from the control stack to the parameter stack.

#Runtime

Every program should have a word named main. This will be the entry point of the program. On startup, some initialization code is run first and then main is called. If main returns, the program would exit with code 0.

The pre-main initialization code consists of the following:

  • Allocate a block of memory. The first part of this block is used by user variables, while the rest are free to be used by the program in whatever way desired. The register "edi" points to the point in the block that the variables end. This is the value that the mem built-in returns. Variables yield addresses with a negative index on edi, for example if the first variable declared is 4 bytes long, its address is edi-4. If there's another variable of the same size after that, it is at edi-8.

  • Allocate a block of memory for the return stack. This is initially stored in the ebp register. The compiler keepts track of the stack in use and emits xchg ebp, esp instruction where required. For example, when we enter main, ebp points to the return stack and esp points to the parameter stack. If the first word is a literal, everything is fine and we won't need an xchg instruction. If after that we need to call another word however, we emit an xchg ebp, esp and then a call instruction. Incidentally, this is the reason built-ins like c/savepoint return two values to describe a point in the code buffer. One of the two values is the address itself, and the other is the current stack at that location.

Apart from this setup procedure and the code that invokes the "exit" system call at the end, there is no run-time environment. The program is responsible for everything from the moment "main" is entered until it terminates or returns.

#Target Environment

The target environment is Linux, and the produced code is 32-bit (although it runs fine in a 64-bit Linux too). The binaries produced by the Starforth compiler are in ELF format. They are statically built, so no shared objects are required, not even libc.

#Compiler Exit Codes

If compilation is successful, the compiles exits with a 0 exit code. The non-zero exit codes each mean an error has happened:

  • 1: error writing to output file
  • 2: unknown word
  • 3: error reading from stdin
  • 4: error creating output file
  • 6: error allocating memory
  • 7: program does not contain a main word

Exit code 5 used to mean error seeking in the output file, but we don't perform seek anymore, so the exit code is no longer in use.

#Bootstrapping

The compiler can be bootstrapped from a small Assembly language compiler. This bootstrapper is only available in the v1 branch in a file named bootstrap.s which can be assembled using nasm. The bootstrap compiler in binary form is only around two kilobytes.

After obtaining the bootstrap binary, you can compile the compiler by feeding it the starforth.f file (via stdin). It would produce a file named a.out which is our "stage 0" compiler. This stage 0 compiler can compile its own source code, resulting in a "stage 1" binary which is (at v1.0 tag at least) exactly the same as the one produced by the bootstrap compiler.

You can then compile the compiler source yet again to get the stage 2 compiler, which is again exactly the same as the other stages (at v1.0). The stage 2 compiler is neither created by the bootstrap compiler not created by code generated from the bootstrap compiler. You can now throw away the assembly language code! :)

There is a verify.sh script located at the tools sub-directory that can bootstrap any of the later versions of the compiler by going back to version one, creating the bootstrap compiler and then compiling each of the later versions one by one, each using the previous version. At each step the test suite is run to make sure the compiler works fine. The stage 1 and stage 2 compilers are also compared at each step, making sure that they are exactly the same.

The verify.sh script creates a sub-directory named out which is used for building the previous versions. The produced binaries are placed in this sub-directory too, named sf-v1, sf-v2, sf-v3, etc.

#Reference Compiler and Interpreter

In order to help with the development and testing of the compiler, I have also written a reference compiler in Python (refc.py) and a reference interpreter in C (refi.c). Both of these are capable of bootstrapping the compiler.

The reference interpreter (refi) is also capable of single stepping though the code and can be useful for debugging.

Notice that these tools are only available in the v1 branch, since they are not capable of compiling any later versions of the language.

#Building

Running make builds the current version using the previous version (which is expected to be in the out sub-directory as created by the verify.sh script). The produced binary is named sf.

#Test Suite

When you have the sf binary, you can run test.sh to run the test suite.

#Debugging

The starforth compiler does not produce debug info. In order to help with debugging, a GDB extension is included with the compiler, named gdbscript.py. If you run gdb from the compiler source directory, the included .gdbinit file will attempt to load the gdb extension. This will by default fail in latest versions of gdb, due to security risks. Follow the instructions in .gdbinit comments to enable loading the extension.

The gdb extension uses the compiler logs as a kind of poor man's debug info. These should be available in a file with the same name as the binary you're debugging, but with .dbginfo suffix added. So if you have a file named foo.f you can compile it by:

./sf <foo.f 2>a.out.dbginfo

This will produce a file named a.out which is the executable, and a file named a.out.dbginfo containing compiler logs/debug info. Now you can run gdb by running gdb a.out.

After launching gdb, assuming the gdb extension is loaded, you can start debugging by running the sf-start command. You must use this command before any other commands. This will load the debug info. If debug info is not found, you will see an error, otherwise the code is run until the beginning of the main word. You can then use any of the following commands:

  • sf-step: Similar to gdb's step command but for starforth words.
  • sf-next: Similar to gdb's next command but for starforth words.
  • sf-vars: If no argument is passed, the list of program variables is printed. If a variable name is passed, the value of the variable is printed.
  • sf-status: Prints the values on the two stacks. This is automatically run after sf-step and sf-next.
  • sf-bt: Prints the current backtrace. Each line is in the form [n] === {current-definition} === {current-word}, where n is the frame number, current-definition is the name of the word definition we're currently inside, while current-word is the word we're about the run in the next step.

The following commands only work properly when debugging the compiler's own binaries:

  • sf-cvars: Print compiler variables with their values.
  • sf-print-dict: Print the contents of compiler's dictionary.

#License

The full source code is provided under the GPL v3 license. A copy of the license text is included in the repository.