Bringing up new hardware systems from scratch, by which I mean systems with little or no pre-existing development tools of any kind, can genuinely be truly daunting. This reinforces barrier-to-entry, and makes it difficult for people to explore opportunities to evolve designs in new and interesting ways going forward. Shoehorn is a cross-compiler intended to reduce the strength of these barriers-to-entry. By providing a cross-compiler that can be customized for a particular processor's instruction set architecture or assembler syntax, by a single developer in only a handful of days (instead of months to years), it should be significantly easier to bring up a new computer environment completely from scratch.
When I started working on the Kestrel-1 prototype back in 2004, I built it using a series of Radio Shack solderless breadboards, a W65C816 processor running at 4MHz, and a whopping 32KB of RAM mapped into the upper half of bank 0, with I/O occupying the lower-half. The I/O consisted of a single W65C22 VIA chip. System software could be loaded into it via a specialized "IPL circuit", which allowed a PC to boot the system via its parallel port. The idea was to use this system as a springboard for more refined, more complete homebrew computer systems. One of my first ideas was to create a kind of 65816-based Jupiter ACE: when you first turned the machine on, the idea was to drop you into a Forth environment. For various reasons, I never got the chance to fulfill my initial vision.
Fast forward to 2021, and I have come to the realization that I only have several more years until its 20th anniversary. It would sure be nice to be able to complete the computer design by then, as a kind of treat to myself and to others who'd love to play with it too. However, it's hard enough to bring up new hardware from scratch; it would be great if creating the first system software image for it could contribute as little as possible to this challenge.
Towards that end, I have created a retargetable Forth cross-compiler called Shoehorn. It accepts as input a Forth program listing, and produces as output an assembly language program listing suitable for consumption by a more conventional assembler.
The version of Shoehorn described in this document was designed to run under GForth 0.7.0 64-bit on a Linux system. While I believe the software is portable to other Forth environments, it has not yet been tested in these systems.
This project initially started out as an experiment named
It stood for subroutine-threaded code test.
As the project evolved to a maturity I was happy with,
I decided to promote the experiment into its own project.
That project needed a unique name that hinted at its intended function.
This project is explicitly a bootstrap compiler, and explicitly not a general purpose compiler. Therefore, the name needed to be related, in some way, to the word bootstrap. Two options came to mind: bootlace and shoehorn. Bootlace works well enough; however, lacing implies linking things together. This compiler doesn't link anything together, but instead is intended to help manufacture something which itself is linked. Shoehorn works better, I think; it is a tool which is used to help don a set of shoes or boots, especially before they are broken in. This is a closer match to the intended use of this project.
So, Shoehorn it is.
Writing a Forth environment from scratch is hard work. It is laborious and error-prone, thanks in large part to the large number of inter-linked data structures comprising the dictionary. You have one record describing each executable word, plus at least one record describing the vocabulary each word appears in (if supported), plus at least one record describing each thread or task (a.k.a. "user", hence the phrase "user variables"), plus at least one record describing each "marker" (a memory management concept introduced with ANS Forth), plus at least one record describing each context for exception handling (if supported), and so on. Some of these records are statically defined in the bootstrap image, others are managed exclusively at run-time, while others still are both referenced during the initial compile phase and during run-time. It's no wonder that people often compare Forth systems to full-blown operating systems; in fact, the first deployed Forth systems were operating systems as well.
And, if at a later time you decide you want to alter the layout of those structures, or the meaning of any existing fields, you just changed the ABI of the system as a whole. You typically must recompile/re-assemble your project from scratch, in the worst cases reusing precious little code in the process.
Ideally, I'd like to write the Forth compiler in Forth itself. This idea dates back to the earliest of Forth systems. Besides being something of a rite of passage, it is also something of a tradition to Forth implementors. It also makes automatically managing all those interconnected data structures a snap.
For the purposes of bootstrapping a new hardware design, for which we assume no established development toolchain beyond an assembler exists, this suggests creating a metacompiler. It would take in a source file containing Forth (or a subset thereof) code and it would produce a corresponding assembly language file.
While the idea of a retargetable cross-compiler is an old idea, historically, these tools often require extensive and time-consuming customization. The simplest Oberon compiler, for example, is documented as being maintainable by a single "motivated" student with several months of effort. Shoehorn manages to be retargetable with only days of effort.
The trick is careful selection of virtual machine primitives so that they are drop-in compatible with a more sophisticated binary interface. You will obviously take a performance hit; however, the goal when bootstrapping a new environment is correctness, not performance. Once the basic system has been debugged, optimization can happen later.
The back-end code generator and the resulting application binary interface mutually interact to influence each other. For explanatory purposes, I will first discuss the interface that all backend code generators must implement, followed by the application binary interface that I settled upon. I will assume we're generating software for the W65C816 microprocessor.
Shoehorn currently defines 11 procedures which must be customized to target a new assembler.
|emit:0branch||L --||Jump to label L if top of stack is zero.|
|emit:call||caddr u --||Invoke subroutine
|emit:close||--||Close assembler's symbol scope.|
|emit:entry||caddr u --||Declare a subroutine
|emit:exit||--||Return from current subroutine.|
|emit:jumpL||L --||Jump unconditionally to label L.|
|emit:label||L --||Declare label L.|
|emit:lit||n --||Push literal
|emit:open||--||Open assembler's symbol scope.|
|emit:str||caddr u --||Push literal string onto the data stack.|
|emit:brkpt||--||Break into a debugger.|
This word compiles code to consume the top of the data stack. If the value consumed is zero, branch to the specified label.
For the W65C816 processor, we emit the following code:
inx inx lda 2,x beq label
This word compiles a subroutine call to the specified Forth word entry label.
Standard prefixes, such as
entry_, are not supplied by the compiler.
Your target adaptation must provide these as appropriate.
Words which have illegal syntax (e.g.,
will already have been
mapped to their assembler-compatible representation
assuming the correct
xname: declarations have been issued ahead of time.
For example, if the following code appears in the source listing:
xname: drop2 2drop
tx: drop2 2drop drop drop ;
then any reference to the symbol
2drop inside a colon definition will emit the following output code:
This word closes a previously opened local naming scope. Some assemblers do not provide explicit syntax for this; however, for those which do, this gives the opportunity to statically demarcate when a scope ends.
xa65 assembler, this maps to the following declaration:
This word demarcates the beginning of a Forth colon (or similar) definition.
For the W65C816 backend,
I prepend the string
entry_ and declare a label.
For example, given the above definition for 2drop,
the following code would be generated:
This word compiles code that returns from the current Forth word. Since the backend builds code for a subroutine-threaded Forth environment, this expands to a simple subroutine return instruction.
This word emits code to unconditionally jump to the specified label number. If the number 123 was supplied as an argument, this expands simply to the following jump instruction.
Correspondingly, this word emits code to declare the existence of a numbered label.
This word emits code to push a single cell wide, numeric literal onto the data stack.
For the W65C816, I decided to use inline subroutine parameter passing to perform the push.
jsr i_literal .word 123
This optimizes for smaller code size. It is also possible to directly inline the push logic, like so:
dex dex lda #123 sta 4,x
The former approach consumes 5 bytes per literal pushed, while the latter takes 7 bytes. The inline push code is also significantly faster as well, saving 12 cycles at a minimum.
This word opens a new local naming scope. Some assemblers do not provide explicit syntax for this; however, for those which do, this gives the opportunity to statically demarcate when a scope begins.
xa65 assembler, this maps to the following declaration:
This word emits code to push a string reference onto the data stack. The result of this is two values pushed onto the stack: the address of the string's contents, and the string's length in bytes.
For the W65C816, I decided to use inline subroutine parameter passing.
jsr i_string .word 20 .asc "Hello from Shoehorn!"
This word emits code that should be intercepted by a machine language or symbolic debugger, whichever is appropriate for your local development environment.
For the W65C816 processor running the IM/F monitor, I store the value $FF00 inline into the output code.
This expands to the bytes $00 followed by $FF, which the processor will interpret as a BRK instruction followed by a $FF signature byte.
Configuring the processor environment to execute code compiled with Shoehorn is a simple task; however, there are some details that need to be kept in mind.
The example discussed in this paper assumes a subroutine-threaded runtime environment.
First, we use the X register as the data stack pointer. Although there's more explicit stack pointer management overhead, it actually results in faster software overall. 
Second, you'll notice that we never reference memory at negative offsets from X. That's because the W65C816 processor only supports unsigned offsets.
you'll notice that the top of data stack sits at offset 4 from X,
and not at offset 0.
This allows us to adjust the X register to pop the data stack,
while retaining the ability to read the former top value off the stack afterwards.
It is vitally important to perform these operations in that order,
so that processor flags are set correctly for, e.g., conditional jumps.
You see this in the code emitted by the
repeated below with commentary for your convenience:
inx ; Pop the data stack. inx lda 2,x ; Retrieve the former top of stack value beq L100 ; If zero, branch to L100.
The four byte offset also gives code enough space to hold a single 24-bit address,
e.g., for words which reference the CPU's entire address space.
There is no long form for direct page indexed indirect addressing (e.g.,
so for long addresses, a word must build a temporary pointer, like so:
; Fetch word from far-address (a b -- n). ; Per usual Forth convention, b is the most-significant ; word of the 32-bit pointer. inx ; Pop one word from stack. inx ; a is at 4,x; b is at 2,x. lda 4,x ; Move big-endian ptr at 2,x sta 0 ; to little-endian ptr at 0,x lda  ; Perform the actual fetch. sta 4,x ; Replace a with n.
When initializing the X register to begin execution of Shoehorn-compiled code, keep in mind that X will generally be decremented or incremented before memory is affected. Also keep in mind the bias offset as well. Thus, given a bias offset of four bytes as discussed above, if your data stack sits in zero-page ($0000-$00FF), X should be initialized to $00FC.
The return stack pointer is kept in the S register, and functions like the normal 6502/65816 stack pointer. Because of this C can call into Forth and vice versa, modulo some parameter marshalling overhead.
For example, if Forth wants to call a C function
it can be done using a dedicated primitive binding:
entry_memcpy: ; ( dst src len -- result ) phx ; Save DSP phd ; Save DP lda 4,x ; len pha lda 6,x ; src pha lda 8,x ; dst pha jsr _memcpy ; Call into C ply ; clean up stack w/o disturbing A ply ply pld ; Restore DP plx ; Restore DSP sta 8,x inx inx inx inx rts
Similarly, if C wants to call into Forth, it must take care to establish the data stack the subroutine will work with, and marshal parameters accordingly.
; WTO - Write To Operator - named after the macro of the ; same name in the IBM System/360 environment. ; This function just invokes the Forth word TYPE. _wto: ; wto(addr, length) phd lda _forth_dp tad ldx _forth_stack lda 7,s sta 6,x lda 5,s sta 4,x jsr entry_type pld rts ; variables holding bank-0 address of Forth's stack ; and direct-page. _forth_dp: .word 0 _forth_stack: .word 0
Sharing the return stack is key in allowing this interoperability to work.
As you can see, most of the quantities worked with are 16-bits wide. Thus, the processor must be running in 65816-native mode, and the M and X flag bits must be clear. It's OK to set M and/or X as needed, provided you take the required precautions for preserving register contents first.
For example, let's look at the code which fetches a word from anywhere in the CPU's address space, only this time let's only grab a single byte:
; Fetch byte from far-address (a b -- c). ; Per usual Forth convention, b is the most-significant ; word of the 32-bit pointer. inx ; Pop one word from stack. inx ; a is at 4,x; b is at 2,x. lda 4,x ; Move big-endian ptr at 2,x sta 0 ; to little-endian ptr at 0,x sep #$20 ; Set 8-bit accumulator lda  ; Perform the actual fetch. rep #$20 ; Set back to 16-bit accumulator and #$00FF ; Clear upper byte of A. sta 4,x ; Replace a with n.
NOTE: Be extremely careful when altering the size of the index registers. As soon as the X bit is cleared, the upper half of the index registers will be set to zero. Therefore, make sure you preserve the values of X and/or Y as required before altering the X bit.
I've described Shoehorn, a cross-compiler for the Forth programming language that is trivial to retarget. Almost anyone familiar with writing assembly language for a processor of their choice is capable of retargeting the compiler in a matter of a couple of days. It generates assembly listings which can be included into a larger assembly language project easily. I've also described the 65816 ABI that I've settled upon for my own work.
Many features commonly found in a Forth environment remain unimplemented.
For example, immediate words like
[CHAR] still need to be written.
However, these will be added into the project on an as-needed basis.
: .n s>d <# #s #> type ; : .L ( L -- ) s>d <# #s [CHAR] L hold #> type ; : indent ( -- ) ." " ; : .entry ( -- ) ." entry_" ; : opc indent type space ; : abs$ ( cap up cao uo - ) opc type cr ; : abs" ( cap up cao uo - ) opc 34 emit type 34 emit cr ; : absE ( cap up cao uo - ) opc .entry type cr ; : abs# ( n cao uo -- ) opc . cr ; : dp,x ( dp -- ) opc .n ." ,x" cr ; : implied ( -- ) opc cr ; : lbl ( L -- ) opc .L cr ; : beqL ( L -- ) s" beq" lbl ; : .byte"" ( caddr u -- ) s" .byte" abs" ; : inx s" inx" implied ; : jmpL ( L -- ) s" jmp" lbl ; : jsr ( caddr u -- ) s" jsr" abs$ ; : jsrE ( caddr u -- ) s" jsr" absE ; : ldad,x ( dp - ) s" lda" dp,x ; : rts ( -- ) s" rts" implied ; : .word ( n -- ) s" .word" abs# ; : >z inx inx 2 ldad,x ; : emit:0branch ( L -- ) >z beqL ; : emit:call ( caddr u - ) jsrE ; : emit:close ( -- ) [char] . emit [char] ) emit cr ; : emit:entry ( caddr u - ) .entry type ." :" cr ; : emit:exit ( - ) rts ; : emit:jumpL ( L - ) jmpL ; : emit:label ( L - ) .L ." :" cr ; : emit:lit ( n -- ) s" i_literal" jsr .word ; : emit:open ( -- ) [char] . emit [char] ( emit cr ; : emit:str ( caddr u -- ) s" i_string" jsr dup .word .byte"" ; : emit:brkpt ( -- ) $FF00 .word ;
: .n s>d <# #s #> type ; : .L ( L -- ) s>d <# #s [CHAR] L hold #> type ; variable strctr 0 strctr ! : .str ( -- ) strctr @ ." STR" .n ; : .nexstr ( -- ) 1 strctr +! .str ; : indent ( -- ) ." " ; : >t0 indent ." ld t0,0(a0)" cr indent ." addi a0,a0,8" cr ; : .entry ( -- ) ." entry_" ; : emit:0branch ( L -- ) >t0 indent ." beq t0,x0," .L cr ; : emit:call ( caddr u - ) indent ." call entry_" type cr ; : emit:close ( -- ) ; : emit:entry ( caddr u - ) .entry type ." :" cr ; : emit:exit ( - ) indent ." jalr x0,0(ra)" cr ; : emit:jumpL ( L - ) indent ." jal x0," .L cr ; : emit:label ( L - ) .L ." :" cr ; : emit:lit ( n -- ) indent ." addi a0,a0,-8" cr indent ." li t0," .n cr indent ." sd t0,0(a0)" cr ; : emit:open ( -- ) ; : emit:str ( caddr u -- ) indent ." addi a0,a0,-16" cr indent ." li t0," dup .n cr indent ." sd t0,0(a0)" cr indent ." la t0," .nexstr cr indent ." sd t0,8(a0)" cr indent ." .data" cr .str ." :" cr indent ." .asc " 34 emit type 34 emit cr indent ." .code" cr ; : emit:brkpt ( -- ) indent ." ebreak" cr ;
 Falvo II, Samuel A. On Subroutine Threading for the W65C816 Processor. 2021 August 18. Accessed 2022 Feb 20. http://sam-falvo.github.io/2021/08/18/on-subroutine-threading-for-the-65816