~happy_shredder/crinkle

Tool for processing MIME documents and storing in a nicer format.
Bump crumble versions. Update README to include more details of the
Start writing up spec

clone

read-only
https://git.sr.ht/~happy_shredder/crinkle
read/write
git@git.sr.ht:~happy_shredder/crinkle

You can also use your local clone with git send-email.

#crinkle

Tool for processing MIME documents (in particular, emails) and storing in a nicer format.

Each section of the document will be processed according to it's type e.g. converting quoted-printable into UTF-8 and converting base64 to binary. Each section is then packed into a tarball.

#Usage

Clone then ./configure; make; sudo make install. This will automatically enable pandoc for converting HTML to MD if pandoc is installed. Binary is installed to /usr/local/bin/crinkle.

Raw MIME documents are passed to the binary via stdin or a filename as an arg. Output is the file crinkle [input] -o <output>. Note that /dev/stdin is a valid output so crinkle can be used in an SMTP pipeline.

#Specification

This document specifies an open format for storing emails as files on disk. It provides particular guarantees allowing easy and reproducible email parsing and handling.

This is a living specification and is developed in conjunction with this repo. It is currently beta quality and should stabilise in the future.

#Introduction

Emails are transmitted and recorded in MIME format. This format is specified in RFC2045, 2046, 2047, 4288, 4289, and 2049. Unfortunately, in the real world MIME implementations do not strictly follow the spec. For example, headers may be improperly formatted, date formats are arbitrary, and so on. Even without these issues the format suffers from decades old poor design: it is being used far beyond what the creators envisioned. Significant issues include:

  • Plain vs multipart. MIME comes in two subtly different flavours, that, in practice, must be handled completely differently. This leads to duplicate metadata, ambiguous parsing, and difficult reconstruction.
  • Multipart messages may be recursive, which complicates representation. This recursion is unnecessary.
  • Attachments. These are base64 text blobs. This results in significant performance penalties, and fragility.
  • Minimal security. Email bodies may be GPG encrypted, but in practice this is nearly impossible to do correctly. Metadata cannot be protected.

bastion aims to resolve some of the security issues. bastion relies on crinkle to store messages robustly, simply and reliably. crinkle solves all the above issues. Of course it may be used elsewhere.

crinkle is very simple and is designed to be handled with standard UNIX-like utilities:

  • A MIME message is unpacked into a primary header, one or more sections, and attachments.
  • The primary header is formatted into TOML.
  • Each section is a short TOML header, detailing content type and so on. This header is followed by the body, which may be processed into a readable format.
  • Each attachment is unpacked into raw bytes.
  • The header, all sections, and all attachments are then packed into a tarball.

By using TOML, all issues parsing MIME headers are avoided and instead one may use existing libraries. TOML is a recent, well defined, simple format for storing key/value text. By separating out and processing each section, one can easily view and further handle email bodies. By separating out the attachments one can immediately serve to a user. Packing into a tarball uses another standard format with archival guarantees.

#Overview

There are two types of MIME documents: plain and multipart. Plain messages consist of a sequence of key: value pairs (e.g. from, to, subject fields; the header), at least 2 new lines, and then a body:

key0: value0
key1: value1
# ...more key-value pairs...

Hello, world!

Note about subtleties: Header fields have the form key: value, with value containing all possible characters, including newlines. This means that parsing a header is slightly more complicated than one might think initially. Headers can also contain duplicate fields.

Multipart messages consist of a header much the same as for plain, and one or more (possibly nested) parts (or sections). All sections have some text type and are specified by a boundary defined by the content-type in the header e.g. Content-Type: multipart/mixed; boundary="0000000000008a01e4059229eec0". Each section contains a short header (defining e.g. encoding and other metadata) and some body. For example, an email could have a plain-text encoded section, a HTML-encoded section, and a base64-encoded attachment:

key0: value0
key1: value1
# ...more key-value pairs...
Content-Type: multipart/mixed; boundary="0000000000008a01e4059229eec0"

--0000000000008a01e4059229eec0
Content-Type: multipart/alternative; boundary="0000000000008a01e1059229eebe"

--0000000000008a01e1059229eebe
Content-Type: text/plain; charset="UTF-8"

Hello, world!

--0000000000008a01e1059229eebe
Content-Type: text/html; charset="UTF-8"

<div dir="ltr">Hello, world!<br></div>

--0000000000008a01e1059229eebe--
--0000000000008a01e4059229eec0
Content-Type: image/png; name="Lenna_(test_image).png"
Content-Disposition: attachment; filename="Lenna_(test_image).png"
Content-Transfer-Encoding: base64
Content-ID: <f_k0d8idqy0>
X-Attachment-Id: f_k0d8idqy0

# snip

--0000000000008a01e4059229eec0--

Because multipart messages can be arbitrarily nested, and may contain large blocks of encoded text (e.g. attachments), parsing can be slow and complicated. Further, text is usually encoded in some weird way e.g. quoted-printable, so everything has to be processed to actually be useful to the user.

crinkle separates out a MIME document into a TOML-encoded header file, and then each section into separate files. Attachments are stored in their native format. Text sections are stored in some sensible format (HTML, markdown, plaintext; all utf-8). The resultant files are all then stored in a tarball. bastion stores emails in this format; each file is independently encrypted. crinkle adds some metadata to the header to note the structure of (possibly nested) sections.

Hence, our goals simply are to

  • simplify headers for easy parsing and processing
  • separate out multipart sections for easy parsing and processing
  • fix message encoding to sensible defaults (utf-8, pandoc markdown if appropriate)

This repo contains a lib under crinkle, and two binaries under src/bin. The crinkle binary simply takes some MIME as input and outputs a crinkle tarball. uncrinkle does the opposite.

#Data types

Headers are a TOML table i.e. key/values. Each value may be any valid TOML type. Sections may be plaintext or HTML. There is also an experimental feature which uses pandoc to convert HTML to markdown. All encoding (e.g. quoted printable, base64, is stripped).

#Strings

These may be multiline for long headers.

Example: subject.

#Numbers (usize)

Example: num_sections.

#Arrays

Example: emailer fields, from, to; attachments.

#Tables

Valid table types are:

#Emailer

{email = "example@example.com", name = "example"}

#Attachment

{filename = "foo.jpg", content-type = 'foo/bar', size = 1}

#ContentType

{...}

#ContentDisposition
#Dates

Currently, rust-toml does not support dates. So for the moment these are stored as a string, in the format 2019-09-01 18:47:17 UTC. In the future a proper date type will be implemented.

#Headers

All header values are processed:

  • base64 is decoded.
  • quoted-printable is decoded.
  • all strings should be UTF-8.
#Primary

Headers are deduplicated, to satisfy TOML requirements. - MIME messages may contain repeated keys, each with a different value - In these cases, each value is collected, processed, and put into an array.

Example:

received = ["""from o3.email.bandcamp.com (o3.email.bandcamp.com [198.21.0.215])
        by example.com (OpenSMTPD) with ESMTPS id a1fa99f4 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO)
        for <example@example.com>;
        Sun, 1 Sep 2019 18:47:11 +0000 (UTC)""", """by filter0116p1iad2.sendgrid.net with SMTP id filter0116p1iad2-543-5D6C1235-13
        2019-09-01 18:47:17.604755642 +0000 UTC m=+429561.889968037""", """from bandcamp.com (ef.82.3da9.ip4.static.sl-reverse.com [169.61.130.239])
        by ismtpd0009p1sjc2.sendgrid.net (SG) with ESMTP id K7NdDks5Rp-hj6zekhaxVQ
        for <example@example.com>; Sun, 01 Sep 2019 18:47:16.923 +0000 (UTC)"""]

Single-line values are stored as single-line TOML strings. Example:

return-path = "<bounces+715912-0519-example=example.com@email.bandcamp.com>"

Multi-line values are stored as multiline TOML strings. Example:

dkim-signature = """v=1; a=rsa-sha1; c=relaxed/relaxed; 
        d=email.bandcamp.com; 
        h=from:to:subject:reply-to:mime-version:content-type; s=smtpapi; 
        bh=GMq8o6onqHwkbKkFgqnrnpHezjc=; b=zdXYpJZGtSuUMv2xRDL3DhRYrQwXZ
        U8crpl/b+TLRc+h/GZcddBH1Mw6kg+FAs5Nuy1npOE7d3zXACBDE95hya/RkcF8Q
        ya7fvewMlGBBfw8ZFRhukaDnTYc9GdyWn/rd5K33a8g7QdlfVpeL++x5sFAcyfVO
        VkPI6RaTPa89DM="""

If there is ambiguity, default to multiline strings.

content-type is represented as a table. Example:

content-type = {boundary = "it_was_only_a_kiss", mime_type = "multipart/alternative"}

An additional num_sections element is added, which counts the number of sections.

An additional attachments element is added. This is an array, with elements of type Attachment.

#Section
---
content-type = {charset = "utf-8", mime_type = "text/plain"}
content-transfer-encoding = "quoted-printable"
section-id = "a187c745-2365-4c9c-b7b2-8f341706e459"
parent-id = "root"
---

#Sections

Sections have a body and a header. The header is described above. The body is UTF-8 text. It may be:

  • plain text
  • HTML
  • markdown (experimental)

Nested sections are flattened. The tree can be reconstructed based on parent-id fields in the header. Top level sections have parent root.

#Attachments

Attachments are raw bytes. On extracting the tarball, the file is immediately valid.

#Library structure

  • content_disposition.rs contains some classes for parsing the content-disposition field.
  • content_type.rs contains some classes for parsing the content-type field.
  • emailer.rs contains some classes for parsing the email-related fields. There is room to replace the parser in here with something more robust.
  • lib.rs contains the bulk of the library.
  • section.rs contains classes related to parsing sections. To be general, I use a tagged enum. A Section may be the body from a plain document (i.e. just some text), a section from a multipart document (i.e. some text and a header), an attachment (i.e. some data and metadata), or empty.
  • test.rs/test/ contains tests. Obviously.

#Library implementation

I use crumble to first parse the document. This is a minimal parser that returns the headers, and unprocessed sections. Crinkle takes this data and flattens the sections, processes them appropriately, and then repacks everything as a tarball. It's then straightforward to reverse this (see the uncrinkle function).

There is a large block at the start of the file that is mostly concerned with converting headers into a TOML format. The crinkle functions outputs crinkle tarballs, and uncrinkle the reverse. Everything else is concerned with processing the MIME document.

#Types

#ContentDisposition

A disposition, and a vector of key/value pairs. Dispositions:

Inline,
Attachment,
FormData,
Signal,
Alert,
Icon,
Render,
RecipientListHistory,
Session,
AIB,
EarlySession,
RecipientList,
Notification,
ByReference,
InfoPackage,
RecordingSession,
Empty,
Unknown,

The key/value pairs are parameters, found after the ; in the MIME.

#ContentType

A MIME type, and a vector of key/value pairs. These are parameters, found after the ; in the MIME.

#Emailer

A name and an email.

#Informal specification

  • fields that contain emails (e.g. to: foo@example, bar@example) are converted to an array of Emailer objects.
  • dates are stored in ISO 8601 format, but as strings, not the native TOML format. At time of writing, the Rust TOML library doesn't expose the datetime type.
  • text fields that contain newlines, :, or quotes are wrapped in triple quotes. Otherwise, they are wrapped in single quotes.
  • text sections contain a header sandwiched between --- and then the body.
  • plain sections are converted into Section by adding a small header with appropriate metadata.
  • multipart sections inherit the header from the MIME document.
  • each section is given a random UUID. These UUIDs are used to record the structure via parent fields added to each header.
  • a num-sections field is added to the main header.
  • an attachments field is added to the main header.
  • if a section or header is encoded as quoted-printable, this is fixed.
  • if a section is encoded as base64, but is text not attachment, fix this.
  • if the pandoc feature is enabled, convert HTML to markdown.

#Licence

GPLv3+

Do not follow this link