Make small style fix
Tweak comments
Add a "runtime" section to README
starforth is a stack-based, Forth-like programming language, with a self-hosting compiler. The project is simply an exercise in bootstrapping a very simple compiler out of a very small bootstrap binary, and maybe growing from there. I chose Forth as the base language because it's one of the simplest programming languages imaginable.
Starforth has only a passive resemblance to Forth. It does not have an interpreter, not an interactive environment, and no dictionary in the sense Forths have. Latest versions do have immediates but they are more limited than usual Forth immediates.
Starforth does not produce threaded code, but actual machine code, unlike many other Forths.
Another small difference is the truth value. Starforth represents true with 1, unlike many other Forths that use -1. This was chosen because x86 instruction set makes using 1 as the true value slightly simpler.
Starforth uses many of the same terminology and traditions as "real" Forth languages. Among them is the term "word" being used for what most other languages might call a "function".
Starforth has two stacks: the parameter stack and the return stack. Whenever we mention "the stack", the parameter stack is meant. Return stack is normally used to keep the return address after a word is called, but it can be used for other purposes too. The values read from/written to the stacks are always full cells (4 bytes).
In the following description, many built-ins are explained by a stack notation.
For example ( n1 n2 -- n1+n2 )
means that at run-time two integer values are
popped from the stack and another is pushed (which is the result of adding the
two values). In the stack notation, n
means an integer (or a signed integer if
it matters). a
means an address. u
means an explicitly unsigned integer. b
means a boolean. c
means a character. Characters are still cell sized on the
stack, but only their least significant byte is used.
We might also use the terms "tos" and "nos". "tos" refers to the item on top of the stack, while "nos" is the item directly behind it.
Starforth has two kinds of comments: Comments starting with the \
token expand
until the end of the line. Comments starting with the (
token expand until a
)
character is encountered.
When the compiler reads a token it first tries to find a definition by that name. If one is found, its compile-time semantics are performed. This is usually to generate code that calls the word at run-time. For immediate words however, the definition is actually executed at compile-time. Constants emit code to push a value at run-time, while variables emit code that pushes an address at run-time.
If no user-defined word is found, we then try the compiler built-ins. These will be compiled to machine code.
If the token is not a user-defined word or a built-in, the compiler then checks
if it's a character literal. These are in the form 'x'
where x is any ascii
character. The emitted code for these is the same as using the ascii code
directly. For example, 'A'
is the same as 65
.
Finally if any of the above does not hold, the compiler tries interpreting the
token as a number. This could start with a sign character, followed by a number
of digits, and can optionally end with h
. If the number ends with h
, it is
interpreted as a hexadecimal number, otherwise as a decimal number.
After bootstrapping from a very limited number of words, the current version of Starforth has an expanded set of built-ins that are directly compiled to machine code.
:
Start a word definition. The token after :
is the name of the word.;
End a word definitionexit
Return from the current definition.immediate
Mark current word as an immediate. Must be used right after word
name, like in: : foo immediate ... ;
. The word body is compiled to a memory
buffer. Whenever the word is encountered during compilation, the body is run
in the compiler's context. Immediates have access to a number of compile-time
built-ins that are not available to non-immediate words. These all begin with
a c/
prefix.constant
Mark current word as a constant. Must be used right after word
name, like in: : foo constant 2 2 + ;
. The word body is compiled, and run,
and the single value it puts on the stack is inserted as a literal whenever
the word is used.variable
Mark current word as a variable. Must be used right after word
name, like in: : foo variable cell ;
The word body is compiled, and run,
and the value it pushes on the stack is taken as the number of bytes to be
allocated for the variable."
Mark current word as a string. Must be used right after word name.
Anything that comes after "
and a single whitespace is taken to be part of
the string until another "
is encountered. A semicolon can also follow
though not required. As an example, the definition: : foobar " foo bar" ;
creates a string named foobar
with the value foo bar
. Strings in
Starforth are length-prefixed. The first cell in the memory space allocated
for the string is the length, which is followed by the number of bytes it
specifies.cells ( n -- n*4 )
multiply tos by cell size (4).ps@ ( -- a )
Push the current value of the parameter stack pointerps! ( a -- )
Write a new address to the parameter stack pointerrs@ ( -- a )
Push the current value of the return stack pointerrs! ( a -- )
Write a new address to the parameter stack pointer@ ( a -- n )
Read the value of a cell (4 bytes) from memory! ( n a -- )
Write a value to a cell in memoryc@ ( a -- c )
Read the value of a single byte from memoryc! ( c a -- )
Write a value to a byte in memory. The least significant byte
of the value read from stack is used.and ( n1 n2 -- n1&n2 )
Bitwise AND two integersor ( n1 n2 -- n1|n2 )
Bitwise OR two integersxor ( n1 n2 -- n1^n2 )
Bitwise XOR two integersinvert ( n -- !n )
Bitwise NOT the value on the stack+ ( n1 n2 -- n1+n2 )
Add two integers- ( n1 n2 -- n1-n2 )
Subtract two integers* ( n1 n2 -- n1*n2 )
Multiply two integers. notice that if the results do
not fit in a cell, only the least significant cell is kept./mod ( n1 n2 -- n1%n2 n1/n2 )
Divide two signed integers, leaving remainder
(nos) and quotient (tos) on the stack./ ( n1 n2 -- n1/n2 )
Divide two signed integers, leaving the quotient on the stack.mod ( n1 n2 -- n1%n2 )
Divide two signed integers, leaving the remainder on the stack.u/mod ( u1 u2 -- u1%u2 u1/u2 )
Divide two unsigned integers, leaving
remainder (nos) and quotient (tos) on the stack.u/ ( u1 u2 -- u1/u2 )
Divide two unsigned integers, leaving the quotient
on the stack.umod ( u1 u2 -- u1%u2 )
Divide two unsigned integers, leaving the remainder on the stack.negate ( n -- -n )
Negate tos1+ ( n -- n+1 )
Increment tos1- ( n -- n-1 )
Decrement toscell+ ( n -- n+4 )
Add cell size to toscell- ( n -- n-4 )
Subtract cell size from toslsr ( u1 u2 -- u1>>u2 )
Perform logical shift to the right. tos is the
number of bits to shift, while nos is the number to be shifted.lsl ( u1 u2 -- u1<<u2 )
Perform logical shift to the left. tos is the
number of bits to shift, while nos is the number to be shifted.asr ( n1 n2 -- n1>>n2 )
Perform arithmetic shift to the right. tos is the
number of bits to shift, while nos is the number to be shifted.asl ( n1 n2 -- n1>>n2 )
The same as lsl
. This was simply added for symmetry!mem ( -- a )
Push the pointer to the start of user memory. This is a block
of memory (currently 32K) that is allocated on startup and can be used for
any purposes. User variables take up part of this block. The address returned
by mem
points to the portion of the block after user variables.?exit ( b -- )
Return from current word if the top of stack is true
(non-zero)0= ( n -- b )
Replace tos with true if it is zero, otherwise with false.0< ( n -- b )
Replace tos with true if it is negative, otherwise with
false.0> ( n -- b )
Replace tos with true if it is positive, otherwise with false.0<= ( n -- b )
Replace tos with true if it is zero or negative, otherwise
with false.0>= ( n -- b )
Replace tos with true if it is zero or positive, otherwise
with false.0<> ( n -- b )
Replace tos with true if it is not zero, otherwise with
false. This effectively converts a number to a canonical boolean.= ( n1 n2 -- b )
Consume and compare tos and nos, push true if they are
equal, false otherwise.< ( n1 n2 -- b )
Consume and compare tos and nos, push true if nos is
smaller than tos, false otherwise.> ( n1 n2 -- b )
Consume and compare tos and nos, push true if nos is
greater than tos, false otherwise.<= ( n1 n2 -- b )
Consume and compare tos and nos, push true if nos is less
than or equal to tos, false otherwise.>= ( n1 n2 -- b )
Consume and compare tos and nos, push true if nos is
greater than or equal to, false otherwise.<> ( n1 n2 -- b )
Consume and compare tos and nos, push true if they are
not equal, false otherwise.u< ( u1 u2 -- b )
Like <
but interprets nos and tos as unsigned numbers.u> ( u1 u2 -- b )
Like >
but interprets nos and tos as unsigned numbers.u<= ( u1 u2 -- b )
Like <=
but interprets nos and tos as unsigned
numbers.>= ( u1 u2 -- b )
Like >=
but interprets nos and tos as unsigned numbers.>r ( u1 u2 -- b )
Move a value from the parameter stack to the return
stack.r> ( u1 u2 -- b )
Move a value from the return stack to the parameter
stack.dup ( n -- n n )
Duplicate tosswap ( n1 n2 -- n2 n1 )
Swap tos and nosdrop ( n -- )
Discard tosover ( n1 n2 -- n1 n2 n1 )
Push a copy of nos on top of the stacktuck ( n1 n2 -- n2 n1 n2 )
Place a copy of tos under nos.rot ( n1 n2 n3 -- n2 n3 n1 )
Rotate top three items on the stack-rot ( n1 n2 n3 -- n3 n1 n2 )
Rotate top three items on the stack
backwards2dup ( n1 n2 -- n1 n2 n1 n2 )
Duplicate top two items on the stack?exit ( b -- )
Read tos. If true, exit current word.mem= ( a1 a2 n -- )
Compare the two memory regions of size n
at a1
and
a2
.memcpy ( a-dst a-src n -- )
Copy n
bytes from the address a-src
to
address a-dst
.syscall0
-syscall6
Perform a Linux system call with 0 to 6 arguments. tos
is system call number, behind that come the system call arguments, so nos
will be the first argument, the item behind it the second argument, and so
forth. As an example, in order to exit the program using the "exit" system
call (number 1 in i386 ABI), with exit code 42, you can use 42 1 syscall1
.tail
if used before a non-built-in word, causes the compiler to emit a
jmp
instruction instead of a call
. This can be used to eliminate tail
calls and allow functions to recurs without exhausting the stack.breakpoint
Emit a breakpoint instruction (int 3
).['] ( -- a )
Read the next word name following [']
at compile-time and
emit code to push its address. The word can later be executed using this
address by the execute
built-in.execute ( a -- )
Read the address on the stack and call it.The following built-ins are only available to immediates. Notice that "the stack" here means the compiler's parameter stack, not the run-time stack.
c/lit
: Read a value from the stack and emit code that pushes that value to
the stack.c/postpone
: Emit code for the token after c/postpone
. For example, the
following immediate emits code that adds 20 to the number on the stack: : 20+ immediate 20 c/lit c/postpone + ;
c/savepoint
: push two values on the stack denoting the current point in the
output code buffer. This pair of values can later be used to emit a backwards
jump to this location.c/emit-jmp
: Read two values from the stack denoting a point in the code, as
generated by c/savepoint
, and emit an unconditional jump instruction to
that point.c/emit-jc
: Similar to c/emit-jmp
but emits a conditional jump that reads
a value from the stack at run-time and makes the jump if it is true.c/emit-backfilled-jmp
: Emits a dummy unconditional jump to an unknown
location. Puts two values on the stack that can be used find and "backfill"
the jump target later.c/emit-backfilled-jc
: Similar to c/emit-backfilled-jmp
, but emits a
conditional jump.c/backfill
: Reads two values from the stack denoting a dummy jump emitted
by c/emit-backfilled-jmp
or c/emit-backfilled-jc
, and "backfills" the
jump to point to the current location in the code buffer.c/>c
: Moves a value from "the stack" to the "control stack". The control
stack is simply an extra stack available at compile time that can be used to
store values. It's mainly used for storing values between two immediates.
Since the immediates run in the context of the compiler, it is usually not
possible to leave values on the normal stack and expect them to be there in
another immediate. The control stack is not used by the compiler itself, so
it can be used for whatever purpose you see fit.c/c>
: Moves a value from the control stack to the parameter stack.Every program should have a word named main
. This will be the entry point of
the program. On startup, some initialization code is run first and then main
is called. If main
returns, the program would exit with code 0.
The pre-main initialization code consists of the following:
Allocate a block of memory. The first part of this block is used by user
variables, while the rest are free to be used by the program in whatever way
desired. The register "edi" points to the point in the block that the
variables end. This is the value that the mem
built-in returns. Variables
yield addresses with a negative index on edi, for example if the first
variable declared is 4 bytes long, its address is edi-4
. If there's another
variable of the same size after that, it is at edi-8
.
Allocate a block of memory for the return stack. This is initially stored in
the ebp
register. The compiler keepts track of the stack in use and emits
xchg ebp, esp
instruction where required. For example, when we enter
main
, ebp
points to the return stack and esp
points to the parameter
stack. If the first word is a literal, everything is fine and we won't need
an xchg
instruction. If after that we need to call another word however, we
emit an xchg ebp, esp
and then a call
instruction. Incidentally, this is
the reason built-ins like c/savepoint
return two values to describe a point
in the code buffer. One of the two values is the address itself, and the
other is the current stack at that location.
Apart from this setup procedure and the code that invokes the "exit" system call at the end, there is no run-time environment. The program is responsible for everything from the moment "main" is entered until it terminates or returns.
The target environment is Linux, and the produced code is 32-bit (although it
runs fine in a 64-bit Linux too). The binaries produced by the Starforth
compiler are in ELF format. They are statically built, so no shared objects are
required, not even libc
.
If compilation is successful, the compiles exits with a 0 exit code. The non-zero exit codes each mean an error has happened:
main
wordExit code 5 used to mean error seeking in the output file, but we don't perform seek anymore, so the exit code is no longer in use.
The compiler can be bootstrapped from a small Assembly language compiler. This
bootstrapper is only available in the v1
branch in a file named bootstrap.s
which can be assembled using nasm. The bootstrap compiler in binary form is only
around two kilobytes.
After obtaining the bootstrap binary, you can compile the compiler by feeding it the starforth.f file (via stdin). It would produce a file named a.out which is our "stage 0" compiler. This stage 0 compiler can compile its own source code, resulting in a "stage 1" binary which is (at v1.0 tag at least) exactly the same as the one produced by the bootstrap compiler.
You can then compile the compiler source yet again to get the stage 2 compiler, which is again exactly the same as the other stages (at v1.0). The stage 2 compiler is neither created by the bootstrap compiler not created by code generated from the bootstrap compiler. You can now throw away the assembly language code! :)
There is a verify.sh
script located at the tools
sub-directory that can
bootstrap any of the later versions of the compiler by going back to version
one, creating the bootstrap compiler and then compiling each of the later
versions one by one, each using the previous version. At each step the test
suite is run to make sure the compiler works fine. The stage 1 and stage 2
compilers are also compared at each step, making sure that they are exactly the
same.
The verify.sh
script creates a sub-directory named out
which is used for
building the previous versions. The produced binaries are placed in this
sub-directory too, named sf-v1
, sf-v2
, sf-v3
, etc.
In order to help with the development and testing of the compiler, I have also written a reference compiler in Python (refc.py) and a reference interpreter in C (refi.c). Both of these are capable of bootstrapping the compiler.
The reference interpreter (refi) is also capable of single stepping though the code and can be useful for debugging.
Notice that these tools are only available in the v1 branch, since they are not capable of compiling any later versions of the language.
Running make
builds the current version using the previous version (which is
expected to be in the out
sub-directory as created by the verify.sh
script).
The produced binary is named sf
.
When you have the sf
binary, you can run test.sh
to run the test suite.
The starforth compiler does not produce debug info. In order to help with
debugging, a GDB extension is included with the compiler, named gdbscript.py
.
If you run gdb
from the compiler source directory, the included .gdbinit
file will attempt to load the gdb extension. This will by default fail in latest
versions of gdb, due to security risks. Follow the instructions in .gdbinit
comments to enable loading the extension.
The gdb extension uses the compiler logs as a kind of poor man's debug info.
These should be available in a file with the same name as the binary you're
debugging, but with .dbginfo
suffix added. So if you have a file named foo.f
you can compile it by:
./sf <foo.f 2>a.out.dbginfo
This will produce a file named a.out
which is the executable, and a file named
a.out.dbginfo
containing compiler logs/debug info. Now you can run gdb by
running gdb a.out
.
After launching gdb, assuming the gdb extension is loaded, you can start
debugging by running the sf-start
command. You must use this command before
any other commands. This will load the debug info. If debug info is not found,
you will see an error, otherwise the code is run until the beginning of the
main
word. You can then use any of the following commands:
sf-step
: Similar to gdb's step
command but for starforth words.sf-next
: Similar to gdb's next
command but for starforth words.sf-vars
: If no argument is passed, the list of program variables is
printed. If a variable name is passed, the value of the variable is printed.sf-status
: Prints the values on the two stacks. This is automatically run
after sf-step
and sf-next
.sf-bt
: Prints the current backtrace. Each line is in the form [n] === {current-definition} === {current-word}
, where n
is the frame number,
current-definition
is the name of the word definition we're currently
inside, while current-word
is the word we're about the run in the next
step.The following commands only work properly when debugging the compiler's own binaries:
sf-cvars
: Print compiler variables with their values.sf-print-dict
: Print the contents of compiler's dictionary.The full source code is provided under the GPL v3 license. A copy of the license text is included in the repository.