~jakintosh/co

the programming language for the coalescent computer
98bdb315 — Jak Tiano 5 months ago
added (0.2.4): 'co build' uses --source option with implicit 'cofile' instead of required operand
78e77770 — Jak Tiano 5 months ago
added better error output to context.rs, fixed some compiler warnings
ce128595 — Jak Tiano 6 months ago
fixed some issues with parsing absolute padding, cleaned up some small compiler warnings

refs

main
browse  log 

clone

read-only
https://git.sr.ht/~jakintosh/co
read/write
git@git.sr.ht:~jakintosh/co

You can also use your local clone with git send-email.

#Co

A content-addressable assembly language for the Coalescent Computer.

#Overview

Co is a concatenative assembly programming language inspired by Forth and Tal. It compiles to the COINS byte code specification, which is used to drive a virtual stack-machine CPU at the heart of the Coalescent Computer.

Co allows an author to write direct COINS opcodes, which are executed in the order they are written. There are two tools for abstraction: Routines, which are named blocks of code that are separated and jumped to at execution time, and Macros which are named blocks of code which get rendered in place at assembly time.

An example of Co, taking a sip of coffee:

: sip      DUP8 >swallow SWP8 SUB8 ;
: swallow  >extract >absorb ;
: extract  LIT8 4 MUL8 LIT8 10 SWP8 DIV8 ;
: absorb   LIT8 0xC0 SWP8 DVW8 ;

LIT8 250 LIT8 10 >sip

The key feature of Co is that both Routine and Macro symbols are stored in a local symbol library indexed by their content hash. Likewise, when those symbols are exported to the symbol library, the names used in the human-written code are resolved to content hashes in their canonical form. This means that the same hash will always execute the exact same code (given the same system state/input data, of course).

#Installation

To get up and running, you'll need Rust installed on your system. Assuming you have that, clone the repo and install like so:

  1. git clone https://git.sr.ht/~jakintosh/co
  2. cd co
  3. cargo install --path .

There are very minimal dependencies and source files, this should take only a few seconds.

To run co executables, you'll need a virtual machine as well. You can install the reference implementation in a very similar manner:

  1. git clone https://git.sr.ht/~jakintosh/cohost
  2. cd cohost
  3. cargo install --path .

#Usage

Installing Co will add the co executable to your path, which is the entire interface to the language. The co program manages both assembly of executable ROM files, and the export to and management of the local symbol library. You can explore the CLI by typing co --help, or even just co.

#ROM Assembly

To assemble a Co source file into an executable rom, use the co assemble <source> <output> command. For example, co assemble coffee.co coffee.rom will read the ./coffee.co source file, assemble it into a ROM, and write that ROM to ./coffee.rom.

#Symbol Import

To import all of the named symbols in a Co source file to the symbol library, use the co library import [-n/--name] <source>. For example, co import -n .co.stack stack.co will read the ./stack.co source file, and then import all of its named symbols into the .co.stack namespace in the symbol library.

#Name Resolving

TODO

#Symbol Library Navigation

To browse the symbol library namespaces, use co library list <namespace>. For example, co library list . will list all names at the root of the library. This includes both named symbols and other namespaces. Using co library list on a namespace table will show another list of names, and using it on a symbol will output the textual bytecode representation of that symbol.

#Build System

TODO

#Executing Programs

The co binary only deals with transforming source code into COINS assembly. To actually run the generated bytecode, you'll need a virtual machine that executes against the COINS spec. The current reference implementation is called Cohost, and can be found at https://git.sr.ht/~jakintosh/cohost

More instructions can be found there, but once installed, cohost run <rom-file> is all you need. cohost run --debug <rom-file> will also give you a visualization of the CPU internals and let you step through each instruction, which can be very helpful when getting started.

#Rationale

Why build this?

Essentially, content-addressable code seemed like a really powerful idea that was under-explored. My inclination towards minimal systems and permacomputing, along with my aversion to bloated modern tech stacks drove me to try building something from as far down as I could go without having to build new hardware.

And so what does this project enable?

The intention behind this system is to enable many individual nodes on a network to easily share, verify, and execute code. When this system is fully built out, you would be able share a single root hash for some routine and then the receiver can resolve that hash through their local symbol library or via a request to the network. You can then (1) verify that the bytecode received from peers is valid by hashing it yourself, and (2) recursively request any missing hashes referenced in that bytecode.

This approach of referring to routines by their content hash also eliminates a number of other tension points common in other computing environments. For instance, since code is identified uniquely at the Routine level, library dependencies are eliminated, and code reuse is fully maximized. A low-level math library that rarely changes can be referenced "statically" by hundreds or thousands of other routines without ever storing the used routines more than once.

In fact, since there's no concept of separate code libraries at all, only the specific routines that are actually used are ever downloaded. Likewise, all routines only have to be loaded into memory once, minimizing the memory footprint of a personal computing system running many programs. Furthermore, static and dynamic linking is no longer a question; all routines are statically linked semantically at assembly time, but can be linked in memory dynamically at run time.

Another exciting consequence is that "delta updates" of larger modules of code will happen automatically. When you make changes to part of your codebase, the merkle-tree structure of hash references means that only nodes above the changes have changed their semantic structure (and thus their hash). When downloading a new version of a module, all of the code that hasn't semantically changed will still be in the library, and not have to be downloaded again.

I could keep going (like how you can quickly check your system for security vulnerabilities by scanning a list of loaded hashes), but this is already enough for an introductory README.

#Writing Co

Co only has three fundamental rules.

First is that tokens are whitespace delimited with either spaces, tabs, or newlines.

Second is that there are five types of tokens:

  1. Runes, which are a specific set of single-character symbols.
  2. Commands, which are a text label prefixed by a single-character symbol.
  3. Names, which are plain-text at the beginning of a symbol definition.
  4. Opcodes, which are the plain-text representations of the machine opcodes.
  5. Number Literals, which are decimal or hex numbers that render to binary.

Third is that there are four types of definitions:

  1. Routines, which are named units of assembly that are jumped to and executed at runtime.
  2. Macros, which are named units of source code that are rendered at assembly time.
  3. Imports, which make routines and macros from the library referencable in a source file.
  4. Assembly, which is any code that exists outside of the previous definition types.

#Token - Runes

There are 8 runes in total:

  1. + denotes the beginning of an import definition.
  2. : denotes the beginning of a routine definition.
  3. % denotes the beginning of a macro definition.
  4. ; denotes the end of any of the prior definitions.
  5. [ denotes the beginning of a list of macro parameters.
  6. ] denotes the end of a list of macro parameters.
  7. ( denotes the beginning of a comment.
  8. ) deontes the end of a comment.

Remember that all tokens are whitespace delimited, meaning that (invalid comment) will parse as an error, and that ( valid comment ) is correct.

Some examples of Runes in source code:

% push-one LIT8 1 ;                ( Push 8-bit '1' on the stack )
: one-plus-one LIT8 1 DUP8 ADD8 ;  ( Push 1, duplicatei it, add )

Again, notice the usage of spaces around the Runes.

#Token - Commands

A command looks like this: >send. Broken down, it is composed of {marker}{label}, which in the previous example would have the marker be '>' and the label be "send".

There are 9 types of markers:

  1. > denotes a routine call.
  2. @ denotes the address of a routine.
  3. ~ denotes a macro usage.
  4. ' denotes a macro parameter.
  5. | denotes an absolute padding value.
  6. $ denotes a relative padding value.
  7. # denotes an anchor definition.
  8. * denotes the absolute address of an anchor.
  9. & denotes the relative address of an anchor.

There are 3 rules for labels:

  1. >, @, and ~ must refer to a known symbol.
  2. | and $ must parse to 16-bit unsigned integers.
  3. * and & must refer to a defined anchor in the file.

Some examples of commands in action:

% push-one LIT8 1 ;        ( macro: push 1 on stack )
: add-one ~push-one ADD8 ; ( routine: add 1 to byte on top of stack )

|0x0000 #start             ( set padding to 0, create 'start anchor' )
	~push-one              ( push 1 on the stack )
	>add-one               ( call 'add-one' routine )
	&start JPR16           ( jump to the 'start' anchor )
#Token - Import Commands

There are a special set of command markers that are only valid inside an Import definition. There are 3 types of import markers:

  1. . denotes that the label is a Path, and also separates its components.
  2. : denotes that the label is a Routine name.
  3. % denotes that the label is a Macro name.

We will cover how these are used in an import definition later on.

#Token - Names

When creating a new Routine or Macro definition, you must give it a local name so that it can be referenced by other Commands. Name tokens are position dependent, and are required after : and % Runes that mark the beginning of a Routine and Macro, respectively.

Examples of names:

% macro-name ( macro body goes here ) ;
: routine-name ( routine body goes here ) ;

#Token - Opcodes

To actually issue commands to the CPU, you write Opcodes. All of the other features of the language exist to create helpful abstractions around rendering useful sequences of opcodes, but ultimately the opcodes are the only part of the language that makes the CPU do anything at all. Co is designed specifically to work with the "Coalescent Instruction Set", abbreviated to COINS.

COINS Opcodes look like this: LIT8 ADD32 SWP16R DUP64. Each opcode has a 3-letter all-caps identifier (with the exception of OR), followed by an 8, 16, 32, or 64 to specify the bit-width of the instruction. Finally, the stack manipulation opcodes can optionally have an R appended to the end to specify that they operate on the return stack, instead of the data stack.

For a deeper dive on what all the opcodes are, and their function, check out the COINS repository.

#Token - Number Literals

The final type of token is a number literal. There are two types of number literal:

  1. 0 the decimal number literal, which is used with LIT opcodes.
  2. 0x00 the hex number literal, which can be used in all number contexts.

Decimal literals can only be used after a LIT8, LIT16, LIT32, or LIT64 opcode. These literals will automatically adapt to the specified literal size, without requiring any padding. For example, LIT8 1 will render to 00000001 while LIT16 1 will render to 0000000000000001. When used outside of a LIT context, decimal literals will render to a 64-bit unsigned integer, making them of limited use.

Hex literals are prefixed by 0x and must include all padding. They can include _ characters as visual separators. Like decimal literals, they can also be used with LIT opcodes, but must match the opcode with full padding. LIT8 0x01 is valid, but LIT16 0x00 is not; LIT16 0x0001 must be used. Hex literals are also required for padding commands. Furthermore, Hex literals can be used anywhere to render directly to bytes in-place. This can be helpful if you want to define binary data.

Examples of number literals in use:

|0x0000 #program
	LIT8 0
	LIT16 0xC0DE
	LIT32 1337

#data
	0x00010203_04050607
	0x08090A0B_0C0D0E0F

#Definition - Routines

Routines are the primary way of abstracting executable code in Co. A routine begins with the : rune, followed by its name, a set of source code tokens, and ends with a ; rune. To call a routine from somewhere else in code, use a command compsoed of the > routine call marker and the routine's name.

Note: Because a routine must render out all of its internal calls to generate a content hash, a routine cannot call itself using the > command. In a future version of the COINS spec, there will be a new Opcode that allows you to push the address of the current routine on the stack, and the Co toolchain will automatically detect that when trying to use > recursively. As of this writing, recursion is not implemented.

Routines cannot contain Commands that use absolute addresses, since a Routine in practice may end up anywhere in memory. It can only use relative addresses.

An example of some routines (shown earlier):

: sip      DUP8 >swallow SWP8 SUB8 ;
: swallow  >extract >absorb ;
: extract  LIT8 4 MUL8 LIT8 10 SWP8 DIV8 ;
: absorb   LIT8 0xC0 SWP8 DVW8 ;

LIT8 250 LIT8 10 >sip

#Definition - Macros

Macros allow the reduction of source code size through their ability to create reusable blocks of source code. A macro command begins with % rune, followed by a name, a set of source tokens, and ending with a ; rune. The source code in a macro will be rendered directly in place of a ~ macro use during assembly.

An example of macros being used to reduce a while loop boilerplate:

% while-start #while-start DUP8 LIT8 0 EQU8 &while-end JCR16 ;
% while-end &while-start JPR16 #while-end ;

: interesting-routine
	~while-start
		( do some work )
	~while-end
;
#Parameterized Macros

After declaring the name of a macro, you can also choose to provide a list of named parameters, which you can then interpolate into labels, using { and } inline of the label.

These parameterized macros are powerful, but limited in their scope. Since they interpolate strings, they don't get the benefit of their symbol references being linked to content hashes, and so they can't be validated and canonicalized on import. However, they provide interesting and useful flexibility in certain use-cases.

An example of a paremeterized macro running an infinite loop based on its input:

% infinite-loop [ macro param ]
	#{macro}-loop-start
		~{macro} '{param}
	*{macro}-loop-start JMP16
;

#Definition - Import

Earlier, we covered some of the Commands that are unique to the Import definition.An import definition begins with a + rune, followed by a Path, a set of import commands, and ends with a ; rune.

A Path begins with the . marker, and then contains a set of path names separated by more . symbols. The root path is just ., but a deeper namespace might look like this: .co.stack.

An actual import command looks like this: %stash8=stash or %stash8. The structure of this command is {symbol-type}{name}={local-name}, where the ={local-name} is optional if you want to keep the {name} for local use. In the example, the % means we're importing a Macro, stash8 is the name of the symbol being imported, and stash in the first example is the override name to be used in the file.

A full example of an import block with its special commands might look like this:

+ .co.stack
	%stash8=stash %unstash8=unstash
;

#Example Program

To conclude the crash course of the programming language, here is an example of a small, but not trivial, program that writes out the first 0x40 bytes of its source code to a device port.

For reference, this compiles to a 281 byte ROM, runs through 8 iterations in the primary #loop in the :send routine, and finishes executing in 383 cycles of the virtual CPU.

: send ( dev8 addr16 len16 -- | DEV[dev8] -> [ len16 ])
	LIT16 0                            ( dev8 addr16 len16 offset16 )
	#loop
		>seek                          ( dev8 addr16 len16 offset16 done8 )
		LIT8 0x80 CPY8 SWP8            ( dev8 addr16 len16 offset16 dev8 done8 )
		>set-flag                      ( dev8 addr16 len16 offset16 slot8 )
		>read                          ( dev8 addr16 len16 offset16 slot8 data64 )
		DVW64                          ( dev8 addr16 len16 offset16 )
		DUP32 LST16
	&loop JCR16
;

: seek ( addr16 len16 offset16 -- .. done8 )
	LIT8 0x05 CPY16 ADD16 ADR16        ( seek memory address to addr16 + offset16 )
	LIT16 8 ADD16                      ( add 8 to offset )
	DUP32 LST16 NOT8                   ( offset >= len? same offset !< len )
;

: set-flag ( dev8 done8 -- slot8 )
	LIT8 0 NEQ8 &done JCR16
	( send ) LIT8 0xC0 OR8 RTN8
	#done LIT8 0x80 OR8
;

: read ( len16 offset16 slot8 -- .. data64 )

	LIT8 0x1E CPY8 SUB16 LIT8 8 SUB8 STH8 ( .. done8 | bytes8 )
	&full-read JCR16

	DUP8R STH8R LIT8 0xFE OR8 NOT8 &check-2 JCR16 LOD8 #check-2
	DUP8R STH8R LIT8 0xFD OR8 NOT8 &check-4 JCR16 LOD16 #check-4
	DUP8R STH8R LIT8 0xFB OR8 NOT8 &check-8 JCR16 LOD32 #check-8
	DUP8R STH8R LIT8 0xF7 OR8 NOT8 &end JCR16 LOD64 #end

	STH8R DUP8 LIT8 8 LST8 &exit JCR16
	( append ) LIT8 8 SUB8 >append-zeroes RTN16
	#exit DRP8 RTN16

	#full-read LOD64 DRP8R
;

: append-zeroes
	#loop
		DUP8 LIT8 0 EQU8 &done JCR16
		LIT8 0 SWP8 LIT8 1 SWP8 SUB8
		&loop JPR16
	#done DRP8
;

|0x0000 #program                       ( program start )
	LIT8 0                             ( dev8 )
	LIT16 0x0000                       ( addr16 )
	LIT16 0x0040                       ( len16 )
	>send