A Forth subset-to-assembly language (cross-)compiler.
Fix broken DO and ?DO loop counter logic
V2.2: Introduce TCONSTANT support
V2.1: Support for immediate-mode words


browse  log 



You can also use your local clone with git send-email.

Version 2.2


Bringing up new hardware systems from scratch, by which I mean systems with little or no pre-existing development tools of any kind, can genuinely be truly daunting. This reinforces barrier-to-entry, and makes it difficult for people to explore opportunities to evolve designs in new and interesting ways going forward. Shoehorn is a cross-compiler intended to reduce the strength of these barriers-to-entry. By providing a cross-compiler that can be customized for a particular processor's instruction set architecture or assembler syntax, by a single developer in only a handful of days (instead of months to years), it should be significantly easier to bring up a new computer environment completely from scratch.


When I started working on the Kestrel-1 prototype back in 2004, I built it using a series of Radio Shack solderless breadboards, a W65C816 processor running at 4MHz, and a whopping 32KB of RAM mapped into the upper half of bank 0, with I/O occupying the lower-half. The I/O consisted of a single W65C22 VIA chip. System software could be loaded into it via a specialized "IPL circuit", which allowed a PC to boot the system via its parallel port. The idea was to use this system as a springboard for more refined, more complete homebrew computer systems. One of my first ideas was to create a kind of 65816-based Jupiter ACE: when you first turned the machine on, the idea was to drop you into a Forth environment. For various reasons, I never got the chance to fulfill my initial vision.

Fast forward to 2021, and I have come to the realization that I only have several more years until its 20th anniversary. It would sure be nice to be able to complete the computer design by then, as a kind of treat to myself and to others who'd love to play with it too. However, it's hard enough to bring up new hardware from scratch; it would be great if creating the first system software image for it could contribute as little as possible to this challenge.

Towards that end, I have created a retargetable Forth cross-compiler called Shoehorn. It accepts as input a Forth program listing, and produces as output an assembly language program listing suitable for consumption by a more conventional assembler.

The version of Shoehorn described in this document was designed to run under GForth 0.7.0 64-bit on a Linux system. While I believe the software is portable to other Forth environments, it has not yet been tested in these systems.

#About the Name Shoehorn

This project initially started out as an experiment named stc-test. It stood for subroutine-threaded code test. As the project evolved to a maturity I was happy with, I decided to promote the experiment into its own project. That project needed a unique name that hinted at its intended function.

This project is explicitly a bootstrap compiler, and explicitly not a general purpose compiler. Therefore, the name needed to be related, in some way, to the word bootstrap. Two options came to mind: bootlace and shoehorn. Bootlace works well enough; however, lacing implies linking things together. This compiler doesn't link anything together, but instead is intended to help manufacture something which itself is linked. Shoehorn works better, I think; it is a tool which is used to help don a set of shoes or boots, especially before they are broken in. This is a closer match to the intended use of this project.

So, Shoehorn it is.

#The Problem

Writing a Forth environment from scratch is hard work. It is laborious and error-prone, thanks in large part to the large number of inter-linked data structures comprising the dictionary. You have one record describing each executable word, plus at least one record describing the vocabulary each word appears in (if supported), plus at least one record describing each thread or task (a.k.a. "user", hence the phrase "user variables"), plus at least one record describing each "marker" (a memory management concept introduced with ANS Forth), plus at least one record describing each context for exception handling (if supported), and so on. Some of these records are statically defined in the bootstrap image, others are managed exclusively at run-time, while others still are both referenced during the initial compile phase and during run-time. It's no wonder that people often compare Forth systems to full-blown operating systems; in fact, the first deployed Forth systems were operating systems as well.

And, if at a later time you decide you want to alter the layout of those structures, or the meaning of any existing fields, you just changed the ABI of the system as a whole. You typically must recompile/re-assemble your project from scratch, in the worst cases reusing precious little code in the process.

#The Idea

Ideally, I'd like to write the Forth compiler in Forth itself. This idea dates back to the earliest of Forth systems. Besides being something of a rite of passage, it is also something of a tradition to Forth implementors. It also makes automatically managing all those interconnected data structures a snap.

For the purposes of bootstrapping a new hardware design, for which we assume no established development toolchain beyond an assembler exists, this suggests creating a metacompiler. It would take in a source file containing Forth (or a subset thereof) code and it would produce a corresponding assembly language file.

While the idea of a retargetable cross-compiler is an old idea, historically, these tools often require extensive and time-consuming customization. The simplest Oberon compiler, for example, is documented as being maintainable by a single "motivated" student with several months of effort. Shoehorn manages to be retargetable with only days of effort.

The trick is careful selection of virtual machine primitives so that they are drop-in compatible with a more sophisticated binary interface. You will obviously take a performance hit; however, the goal when bootstrapping a new environment is correctness, not performance. Once the basic system has been debugged, optimization can happen later.

#The Details

The back-end code generator and the resulting application binary interface mutually interact to influence each other. For explanatory purposes, I will first discuss the interface that all backend code generators must implement, followed by the application binary interface that I settled upon. I will assume we're generating software for the W65C816 microprocessor.

#Back-End Interface

Shoehorn currently defines 22 procedures which must be customized to target a new assembler.

Procedure Stack effect Purpose Version
emit:0branch L -- Jump to label L if top of stack is zero. 1.0
emit:call caddr u - Invoke subroutine entry_XXX for word XXX. 1.0
emit:close -- Close assembler's symbol scope. 1.0
emit:entry caddr u - Declare a subroutine entry_XXX for word XXX. 1.0
emit:exit - Return from current subroutine. 1.0
emit:jumpL L - Jump unconditionally to label L. 1.0
emit:label L - Declare label L. 1.0
emit:lit n -- Push literal n onto the data stack. 1.0
emit:open -- Open a new assembler symbol scope. 1.0
emit:str caddr u -- Push literal string onto the data stack. 1.0
emit:brkpt -- Break into a debugger. 1.0
emit:prim caddr u - Jump to primitive prim_XXX for word XXX. 2.0
emit:create -- Push data field of a CREATEd word onto the data stack. 2.0
emit:constant -- Push single cell numeric CONSTANT onto the data stack. 2.2
emit:cell n -- Emit raw cell. 2.0
emit:char c -- Emit raw byte. 2.0
emit:litaddr caddr u -- Push address of word onto the data stack. 2.0
emit:do L L -- L L Emit DO-loop boilerplate. 2.0
emit:?do L L -- L L Emit ?DO-loop boilerplate. 2.0
emit:loop L L -- Loop back to a DO-loop's top of body. 2.0
emit:+loop L L -- After incrementing counter, loop back to a ?DO-loop's top of body. 2.0
emit:header caddr u caddr u -- Emit a normal Forth dictionary header structure. 2.0
emit:iheader caddr u caddr u -- Emit an immediate Forth dictionary header structure. 2.1
emit:voclinks -- Emit a Forth dictionary vocabulary header structure. 2.0


This word compiles code to consume the top of the data stack. If the value consumed is zero, branch to the specified label.

For the W65C816 processor, we emit the following code:

lda 2,x
beq label


This word compiles a subroutine call to the specified Forth word entry label. Standard prefixes, such as entry_, are not supplied by the compiler. Your target adaptation must provide these as appropriate.

Words which have illegal syntax (e.g., 2DROP, +, */) will already have been mapped to their assembler-compatible representation assuming the correct xref: and xname: declarations have been issued ahead of time.

For example, if the following code appears in the source listing:

xname: drop2 2drop


tx: drop2 2drop  drop drop  ;

then any reference to the symbol 2drop inside a colon definition will emit the following output code:

jsr entry_drop2


This word closes a previously opened local naming scope. Some assemblers do not provide explicit syntax for this; however, for those which do, this gives the opportunity to statically demarcate when a scope ends.

For the xa65 assembler, this maps to the following declaration:



This word demarcates the beginning of a Forth colon (or similar) definition.

For the W65C816 backend, I prepend the string entry_ and declare a label. For example, given the above definition for 2drop, the following code would be generated:



This word compiles code that returns from the current Forth word. Since the backend builds code for a subroutine-threaded Forth environment, this expands to a simple subroutine return instruction.



This word emits code to unconditionally jump to the specified label number. If the number 123 was supplied as an argument, this expands simply to the following jump instruction.

jmp L123


Correspondingly, this word emits code to declare the existence of a numbered label.



This word emits code to push a single cell wide, numeric literal onto the data stack.

For the W65C816, I decided to use inline subroutine parameter passing to perform the push.

jsr i_literal
.word 123

This optimizes for smaller code size. It is also possible to directly inline the push logic, like so:

lda #123
sta 4,x

The former approach consumes 5 bytes per literal pushed, while the latter takes 7 bytes. The inline push code is also significantly faster as well, saving 12 cycles at a minimum.


This word opens a new local naming scope. Some assemblers do not provide explicit syntax for this; however, for those which do, this gives the opportunity to statically demarcate when a scope begins.

For the xa65 assembler, this maps to the following declaration:



This word emits code to push a string reference onto the data stack. The result of this is two values pushed onto the stack: the address of the string's contents, and the string's length in bytes.

For the W65C816, I decided to use inline subroutine parameter passing.

jsr i_string
.word 20
.asc "Hello from Shoehorn!"


This word emits code that should be intercepted by a machine language or symbolic debugger, whichever is appropriate for your local development environment.

For the W65C816 processor running the IM/F monitor, I store the value $FF00 inline into the output code.

.word 65280

This expands to the bytes $00 followed by $FF, which the processor will interpret as a BRK instruction followed by a $FF signature byte.


As with colon-definitions, using the xref: or xname: word to declare a pre-existing primitive to Shoehorn will create a word header that is linked into the Forth dictionary. A direct jump to a primitive will then be encoded.

Semantically, xref: foo is equivalent to : foo prim_foo ; in Forth, except that prim_foo cannot be directly named from Forth code.

emit:prim is used to encode the unconditional branch to the named primitive. (The name is guaranteed to be assembly-safe.)


Typically emits code similar to JSR i_create so that the address of a buffer may be placed onto the data stack. i_create must be provided in the standard runtime.


Typically emits code similar to JSR i_constant so that the value of a constant may be placed onto the data stack. i_constant must be provided in the standard runtime.


Emits a raw cell. Typically used to populate a buffer or vector made with TCREATE.


Emits a raw byte. Typically used to populate a buffer or vector made with TCREATE.


Emits code that loads the address of the named entrypoint (guaranteed assembly-safe) onto the data stack. Typically used with ['] foo to push the address of entry_foo.


Emits a DO-loop's boilerplate code. See also emit:loop.


Emit ?DO-loop boilerplate. See also emit:+loop.


Loop back to a DO-loop's top of body.


After incrementing counter, loop back to a ?DO-loop's top of body.


Emit a word's dictionary header.


Emit a word's dictionary header, but with settings appropriate for an IMMEDIATE-mode word.

Emits the top-level vocabulary linkage structure for the current vocabulary.

#Application Binary Interface

Configuring the processor environment to execute code compiled with Shoehorn is a simple task; however, there are some details that need to be kept in mind.

The example discussed in this paper assumes a subroutine-threaded runtime environment.

#The Data Stack

First, we use the X register as the data stack pointer. Although there's more explicit stack pointer management overhead, it actually results in faster software overall. [1]

Second, you'll notice that we never reference memory at negative offsets from X. That's because the W65C816 processor only supports unsigned offsets.

Third, you'll notice that the top of data stack sits at offset 4 from X, and not at offset 0. This allows us to adjust the X register to pop the data stack, while retaining the ability to read the former top value off the stack afterwards. It is vitally important to perform these operations in that order, so that processor flags are set correctly for, e.g., conditional jumps. You see this in the code emitted by the emit:0branch definition, repeated below with commentary for your convenience:

inx         ; Pop the data stack.
lda 2,x     ; Retrieve the former top of stack value
beq L100    ; If zero, branch to L100.

The four byte offset also gives code enough space to hold a single 24-bit address, e.g., for words which reference the CPU's entire address space. There is no long form for direct page indexed indirect addressing (e.g., [4,x] vs. (4,x)), so for long addresses, a word must build a temporary pointer, like so:

; Fetch word from far-address (a b -- n).
; Per usual Forth convention, b is the most-significant
; word of the 32-bit pointer.

inx         ; Pop one word from stack.
inx         ; a is at 4,x; b is at 2,x.
lda 4,x     ; Move big-endian ptr at 2,x
sta 0       ;   to little-endian ptr at 0,x
lda [0]     ; Perform the actual fetch.
sta 4,x     ; Replace a with n.

When initializing the X register to begin execution of Shoehorn-compiled code, keep in mind that X will generally be decremented or incremented before memory is affected. Also keep in mind the bias offset as well. Thus, given a bias offset of four bytes as discussed above, if your data stack sits in zero-page ($0000-$00FF), X should be initialized to $00FC.

#The Return Stack

The return stack pointer is kept in the S register, and functions like the normal 6502/65816 stack pointer. Because of this C can call into Forth and vice versa, modulo some parameter marshalling overhead.

For example, if Forth wants to call a C function memcpy, it can be done using a dedicated primitive binding:

entry_memcpy:   ; ( dst src len -- result )
    phx         ; Save DSP
    phd         ; Save DP
    lda 4,x     ; len
    lda 6,x     ; src
    lda 8,x     ; dst
    jsr _memcpy ; Call into C
    ply         ; clean up stack w/o disturbing A
    pld         ; Restore DP
    plx         ; Restore DSP
    sta 8,x

Similarly, if C wants to call into Forth, it must take care to establish the data stack the subroutine will work with, and marshal parameters accordingly.

; WTO - Write To Operator - named after the macro of the
; same name in the IBM System/360 environment.
; This function just invokes the Forth word TYPE.

_wto:           ; wto(addr, length)
    lda _forth_dp
    ldx _forth_stack
    lda 7,s
    sta 6,x
    lda 5,s
    sta 4,x
    jsr entry_type

; variables holding bank-0 address of Forth's stack
; and direct-page.
_forth_dp:      .word   0
_forth_stack:   .word   0

Sharing the return stack is key in allowing this interoperability to work.

#Long Mode

As you can see, most of the quantities worked with are 16-bits wide. Thus, the processor must be running in 65816-native mode, and the M and X flag bits must be clear. It's OK to set M and/or X as needed, provided you take the required precautions for preserving register contents first.

For example, let's look at the code which fetches a word from anywhere in the CPU's address space, only this time let's only grab a single byte:

; Fetch byte from far-address (a b -- c).
; Per usual Forth convention, b is the most-significant
; word of the 32-bit pointer.

inx         ; Pop one word from stack.
inx         ; a is at 4,x; b is at 2,x.
lda 4,x     ; Move big-endian ptr at 2,x
sta 0       ;   to little-endian ptr at 0,x
sep #$20    ; Set 8-bit accumulator
lda [0]     ; Perform the actual fetch.
rep #$20    ; Set back to 16-bit accumulator
and #$00FF  ; Clear upper byte of A.
sta 4,x     ; Replace a with n.

NOTE: Be extremely careful when altering the size of the index registers. As soon as the X bit is set, the upper half of the index registers will be set to zero. Therefore, make sure you preserve the values of X and/or Y as required before altering the X bit.

#Conclusion and Further Work

I've described Shoehorn, a cross-compiler for the Forth programming language that is trivial to retarget. Almost anyone familiar with writing assembly language for a processor of their choice is capable of retargeting the compiler in a matter of a couple of days. It generates assembly listings which can be included into a larger assembly language project easily. I've also described the 65816 ABI that I've settled upon for my own work.

Many features commonly found in a Forth environment remain unimplemented. For example, immediate words like [CHAR] still need to be written. However, these will be added into the project on an as-needed basis.

#Appendix A - W65C816 Back-End

\ This Source Code Form is subject to the terms of the Mozilla Public
\ License, v. 2.0. If a copy of the MPL was not distributed with this
\ file, You can obtain one at https://mozilla.org/MPL/2.0/.

\ Header generation is hashed into one of eight buckets.
\ Unfortunately, we can only work with symbols, not addresses.
\ Thus, instead of a convenient record of eight cells to
\ keep track of the state of the vocabulary linkage, we need
\ to record eight full-length name buffers instead.

64 CONSTANT b/name
: fits ( u -- u )
  b/name 1- OVER U>= 0= ABORT" Name Too Long to Hash" ;
8 CONSTANT #buckets
b/name #buckets * CONSTANT b/vocab
CREATE vocab
  b/vocab ALLOT
: bucket ( n -- addr )
  b/name * vocab + ;
: remember ( caddr u addr -- )
  >R fits R> ( caddr u addr )
  2DUP C!
  1+ SWAP MOVE ;
: recall ( addr -- caddr u )
  count ;
: 0vocab
  #buckets 0 DO S" 00" I bucket remember LOOP ;
: hash ( caddr -- addr )
  C@ 7 AND bucket ;
: name>bucket ( caddr u -- addr )
  DROP hash ;

: .n                        s>d <# #s #> type ;
: .L ( L -- )               s>d <# #s [CHAR] L hold #> type ;

: indent ( -- )             ."   " ;
: .entry ( -- )             ." entry_" ;
: .prim ( -- )              ." prim_" ;
: .hdr ( -- )               ." header_" ;
: opc                       indent type space ;

: abs$ ( cap up cao uo - )  opc type cr ;
\ : abs" ( cap up cao uo - )  opc 34 emit type 34 emit cr ;
: abs" ( cap up cao uo - )  opc BEGIN DUP WHILE OVER C@ .  DUP 1 XOR IF ." ," THEN  1 /STRING REPEAT 2DROP CR ;
: absE ( cap up cao uo - )  opc .entry type cr ;
: absP ( cap up cao uo - )  opc .prim type cr ;
: abs# ( n cao uo -- )      opc . cr ;
: dp,x ( dp -- )            opc .n ." ,x" cr ;
: dp,s ( dp -- )            opc .n ." ,s" cr ;
: implied ( -- )            opc cr ;
: lbl ( L -- )              opc .L cr ;
: name# ( ca u cao uo - )   opc ." #entry_" type cr ;

: beqL ( L -- )             s" beq" lbl ;
: .byte"" ( caddr u -- )    s" .byte" abs" ;
: inx                       s" inx" implied ;
: inc                       s" inc" implied ;
: jmpL ( L -- )             s" jmp" lbl ;
: braL ( L -- )             s" bra" lbl ;
: bneL ( L -- )		    s" bne" lbl ;
: jsr ( caddr u -- )        s" jsr" abs$ ;
: jsrE ( caddr u -- )       s" jsr" absE ;
: jmpP ( caddr u -- )       s" jmp" absP ;
: ldad,x ( dp - )           s" lda" dp,x ;
: adcd,x ( dp - )           s" adc" dp,x ;
: ldad,s ( dp - )           s" lda" dp,s ;
: cmpd,s ( dp - )           s" cmp" dp,s ;
: rts ( -- )                s" rts" implied ;
: .word ( n -- )            s" .word" abs# ;
: .byte ( n -- )            s" .byte" abs# ;
: dex                       s" dex" implied ;
: stad,x ( dp - )           s" sta" dp,x ;
: clc                       s" clc" implied ;
: pha                       s" pha" implied ;
: pla                       s" pla" implied ;

: >z                        inx inx 2 ldad,x ;
: reserve		    dex dex ;
: lda#name		    s" lda" name# ;
: >t                        4 stad,x ;

: emit:0branch ( L -- )     >z beqL ;
: emit:prim ( caddr u - )   jmpP ;
: emit:call ( caddr u - )   jsrE ;
: emit:close ( -- )         ." .)" cr ;
: emit:entry ( caddr u - )  .entry type ." :" cr ;
: emit:exit ( - )           rts ;
: emit:jumpL ( L - )        jmpL ;
: emit:label ( L - )        .L ." :" cr ;
: emit:lit ( n -- )         s" i_literal" jsr  .word ;
: emit:open ( -- )          ." .(" cr ;
: emit:str ( caddr u -- )   s" i_string" jsr  dup .word  .byte"" ;
: emit:brkpt ( -- )         $FF00 .word ;
: emit:create ( -- )        S" i_create" jsr ;
: emit:constant ( -- )      S" i_constant" jsr ;
: emit:cell ( n -- )        .word ;
: emit:char ( c -- )        .byte ;
: emit:litaddr ( caddr u -- ) reserve lda#name >t ;
: do-common                 6 ldad,x pha 4 ldad,x pha inx inx inx inx ;
: emit:do ( L L -- L L )    do-common DUP emit:label ;
: emit:?do ( L L -- L L )   do-common OVER bral DUP emit:label ;
: loop-common ( L L -- )
  SWAP emit:label
  3 cmpd,s
  pla pla ;
: emit:loop ( L L -- )      1 ldad,s inc loop-common ;
: emit:+loop ( L L -- )     inx inx 1 ldad,s clc 2 adcd,x loop-common ;


: emit:header ( caddr u caddr u -- )
              ( `-asm-' `forth'    )
  OVER hash hsh !
  >R >R 2DUP .hdr type ." :" cr
  R> R> DUP DUP >R .byte .byte"" R> .byte 
  0 .word
  S" .word" opc .entry hsh @ recall type cr
  hsh @ remember

: emit:iheader ( caddr u caddr u -- )
               ( `-asm-' `forth'    )
  OVER hash hsh !
  >R >R 2DUP .hdr type ." :" cr
  R> R> DUP DUP >R .byte .byte"" R> .byte 
  $4000 .word
  S" .word" opc .entry hsh @ recall type cr
  hsh @ remember

: emit:voclinks ( -- )
  #buckets 0 DO I bucket count S" .word" absE LOOP 0vocab ;

#Appendix B - RISC-V RV64I Back-End

NOTE: This code only works with V1.0 of Shoehorn, and has not yet been updated to V2.x.

: .n                        s>d <# #s #> type ;
: .L ( L -- )               s>d <# #s [CHAR] L hold #> type ;

variable strctr
0 strctr !
: .str ( -- )               strctr @ ." STR" .n ;
: .nexstr ( -- )            1 strctr +!  .str ;

: indent ( -- )             ."   " ;
: >t0                       indent ." ld t0,0(a0)" cr 
                            indent ." addi a0,a0,8" cr ;
: .entry ( -- )             ." entry_" ;

: emit:0branch ( L -- )     >t0 indent ." beq t0,x0," .L cr ;
: emit:call ( caddr u - )   indent ." call entry_" type cr ;
: emit:close ( -- )         ;
: emit:entry ( caddr u - )  .entry type ." :" cr ;
: emit:exit ( - )           indent ." jalr x0,0(ra)" cr ;
: emit:jumpL ( L - )        indent ." jal x0," .L cr ;
: emit:label ( L - )        .L ." :" cr ;
: emit:lit ( n -- )         indent ." addi a0,a0,-8" cr
                            indent ." li t0," .n cr
                            indent ." sd t0,0(a0)" cr ;
: emit:open ( -- )          ;
: emit:str ( caddr u -- )   indent ." addi a0,a0,-16" cr
                            indent ." li t0," dup .n cr
                            indent ." sd t0,0(a0)" cr
                            indent ." la t0," .nexstr cr
                            indent ." sd t0,8(a0)" cr
                            indent ." .data" cr
                            .str ." :" cr
                            indent ." .asc " 34 emit type 34 emit cr 
                            indent ." .code" cr ;
: emit:brkpt ( -- )         indent ." ebreak" cr ;


[1] Falvo II, Samuel A. On Subroutine Threading for the W65C816 Processor. 2021 August 18. Accessed 2022 Feb 20. http://sam-falvo.github.io/2021/08/18/on-subroutine-threading-for-the-65816