~sircmpwn/bare

Binary Application Record Encoding
ee4b7b03 — Bitbake Tester 2 months ago
Update invariants
Fix enum-value-name
Any nonzero integer indicates a present optional

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~sircmpwn/bare
read/write
git@git.sr.ht:~sircmpwn/bare

You can also use your local clone with git send-email.

				     DRAFT

		      Binary Application Record Encoding

Binary Application Record Encoding (BARE) is, as the name implies, a simple
binary representation for structured application data.

BARE messages omit type information, and are not self-describing. The structure
of a message must be established out of band, generally by prior agreement and
context - for example, if a BARE message is returned from /api/user/info, it
can be inferred from context that the message represents user information, and
the structure of such messages is available in the documentation for this API.

				 MESSAGE FORMAT

A BARE message is a single value of a pre-defined type, though the type and its
encoded value may be an aggregate type. The encoding of each type is specified
as follows:

				 BUILT-IN TYPES

The following primitive data types are supported:

	uint, int
		A variable-length integer. Each octet of the encoded value has
		the most-significant bit set, except for the last octet. The
		remaining bits are the integer value in 7-bit groups,
		least-significant first.

		Signed integers are mapped to unsigned integers using "zig-zag"
		encoding: positive values x are written as 2*x + 0, negative
		values are written as 2*(^x) + 1; that is, negative numbers are
		complemented and whether to complement is encoded in bit 0.

		The maximum precision of a varint is 64 bit.

	u8, u16, u32, u64
		An unsigned little-endian integer with a fixed length in bits.
		The precision is 8, 16, 32, and 64 bits respectively.

	i8, i16, i32, i64
		A signed two's complement, little-endian integer with a fixed
		length in bits. The precision is 8, 16, 32, and 64 bits
		respectively.

	f32, f64
		A 32-, or 64-bit IEEE-754 floating point number, little-endian.

	bool
		A boolean, either true or false, represented respectively by a
		one or a zero encoded as an 8-bit unsigned integer. Any non-zero
		value is interpreted as true.

	enum
		A value from a set of possible values enumerated in advance,
		encoded as a uint.

	string
		A UTF-8 string of text, prefixed by the string's length in bytes
		as a uint.

	data<length>
		Arbitrary binary data with a fixed "length" in bytes, e.g.
		data<16>. The binary data is encoded literally. The length must
		be representable as a u64, but is not encoded into the message.

	data
		Arbitrary binary data of an undefined length. The length in
		bytes is encoded as a uint, followed by the binary data encoded
		literally.

	void
		A type with zero length. It is useful to create user-defined
		types which alias void to create discrete options in a tagged
		union which do not have any underlying storage.

Additionally, the following aggregate types are supported:

	optional<type>
		A value of "type" which may or may not be assigned, e.g.
		optional<u32>. Represented either as an 8-bit unsigned integer
		0, indicating that the value is unset; or any nonzero integer to
		indicate that the value is set, followed by the value.

	[length]type
		An array of values of "type" with a fixed "length", e.g.
		[8]string. The encoding of this value is the encoded member
		values concatenated to one another, with no delimiters or length
		prefix.

	[]type
		An array of values of "type" with an undefined length, e.g.
		[]string. The length of the array in values is encoded into the
		message as a uint, followed by the concatenated values.

	map[type A]type B
		A map of values of type B keyed by values of type A, e.g.
		map[u32]string. The encoded representation of a map begins with
		the number of key/value pairs encoded as a uint, followed by the
		key/value pairs concatenated together. Each key/value pair is
		encoded as the encoded key and encoded value concatenated.

		The order of items is undefined, and if a key is repeated, the
		last key/value pair of that key is considered authoritative.

	(type | type | ...)
		A tagged union whose value can be one of any type from a set.
		Each type in the set is assigned a numeric representation,
		starting at zero and incrementing for each type. The value is
		encoded as the selected tag as a uint, followed by the value
		itself encoded as that type.

	struct
		A set of values of arbitrary types, concatenated together in an
		order known in advance.

			       USER-DEFINED TYPES

A user-defined type gives a name to a built-in type, or aliases another type.
This creates a distinct type, whose underlying storage is equivalent to the
type it names.

				   INVARIANTS

The following invariants must be upheld in a BARE schema:

1. Any type which is ultimately a void type (either directly or through
   user-defined types) may not be used as an optional type, struct member, array
   member, or map key or value. Void types may only be used as members of the
   set of types in a tagged union.
2. The lengths of fixed-length arrays and data types must be at least 1.
3. Structs must have at least one field.
4. Unions must have at least one type.
5. Map keys must use a primitive type which is not data, data<length>.
6. Two or more values in the same enum cannot share the same value.

			    MESSAGE SCHEMA LANGUAGE

The use of a schema language is optional, and implementations should support
decoding arbitrary BARE messages without such a document, or by defining the
schema in a manner utilizing more native tools available from the language or
runtime environment.

However, it may be useful to have a schema language, for use with code
generation, documentation, or interoperability. A domain-specific language is
provided for this purpose.

During lexical analysis, whitespace may be used to separate tokens, and is then
discarded. Additionally, "#" is used for comments; if encountered, the "#"
character and any subsequent characters are discarded until a LF is found. The
syntax of this language is represented by the following ABNF grammar (see
RFC5234):

	schema		= 1*user-type

	user-type	 = "type" user-type-name non-enum-type
	user-type	/= "enum" user-type-name enum-type

	type		= non-enum-type / enum-type
	non-enum-type	= primitive-type / aggregate-type / user-type-name

	user-type-name	= UPPER *(ALPHA / DIGIT) ; First letter is uppercase

	primitive-type	 = "int" / "i8"  / "i16" / "i32" / "i64"
	primitive-type	/= "uint" / "u8"  / "u16" / "u32" / "u64"
	primitive-type	/= "f32" / "f64"
	primitive-type	/= "bool"
	primitive-type	/= "string"
	primitive-type	/= "data" / ("data" "&lt;" integer "&gt;")
	primitive-type	/= "void"

	enum-type	= "{" enum-values "}"
	enum-values	= enum-value / (enum-values enum-value)
	enum-value	= enum-value-name / (enum-value-name "=" integer)
	enum-value-name	= UPPER *(UPPER / DIGIT / "_")

	aggregate-type	 = optional-type
	aggregate-type	/= array-type
	aggregate-type	/= map-type
	aggregate-type	/= union-type
	aggregate-type	/= struct-type

	optional-type	= "optional" "<" type ">"

	array-type	= "[" [integer] "]" type
	integer		= 1*DIGIT

	map-type	= "map" "[" type "]" type

	union-type	= "(" union-members ")"
	union-members	= union-member / (union-members "|" union-member)
	union-member	= type ["=" integer]

	struct-type	= "{" fields "}"
	fields		= field / (fields field)
	field		= 1*ALPHA ":" type

	UPPER		= %x41-5A ; uppercase ASCII letters

Here is a simple example schema using this language:

	type PublicKey data<128>
	type Time string # ISO 8601

	enum Department {
		ACCOUNTING
		ADMINISTRATION
		CUSTOMER_SERVICE
		DEVELOPMENT

		# Reserved for the CEO
		JSMITH = 99
	}

	type Customer {
		name: string
		email: string
		address: Address
		orders: []{
			orderId: i64
			quantity: i32
		}
		metadata: map[string]data
	}

	type Employee {
		name: string
		email: string
		address: Address
		department: Department
		hireDate: Time
		publicKey: optional<PublicKey>
		metadata: map[string]data
	}

	type Person (Customer | Employee)

	type Address {
		address: [4]string
		city: string
		state: string
		country: string
	}

The names of fields and user-defined types are informational: they are not
represented in BARE messages, but they may be used for code generation or to
provide meaningful names for readers of the schema.

Enum values are also informational. Values without an assigned integer are
assigned automatically in the order that they appear, starting from zero and
incrementing for each subsequent unassigned value. If an enum value is
explicitly specified, automatic assignment continues from that value plus one
for subsequent enum values.

Union type members are assigned a tag in the order that they appear, starting
from zero and incrementing for each subsequent type. If a tag value is
explicitly specified, automatic assignment continues from that value plus one
for subsequent values.

		     COMPATIBILITY BETWEEN SCHEMA UPGRADES

This section is informative.

The recommended approach for message versioning is with the use of union types.
Adding new types to a union is backwards compatible with previous messages. For
example, the following schema provides several versions of a message:

	type Message (MessageV1 | MessageV2 | MessageV3)

	type MessageV1 {
	    ...
	}

	type MessageV2 {
	    ...
	}

	type MessageV3 {
	    ...
	}

An updated schema which added a MessageV4 would still be able to decode
versions 1, 2, and 3. However, you must make the decision to use versioning in
advance. Replacing a struct type with a union type that contains the same
struct is NOT backwards compatible.

If you later decide to deprecate MessageV1, you may remove it and specify the
initial tag explicitly:

	type Message (MessageV2 = 1 | MessageV3)

	type MessageV2 {
	    ...
	}

	type MessageV3 {
	    ...
	}

			    SECURITY CONSIDERATIONS

Implementations must take care when decoding types with an unbounded length
(e.g. []int, map, data), as a malicious message can be created with an excessive
length and cause a naive implementation to enable denial-of-service attacks,
failed allocations, or other security faults.