Bump crumble versions. Update README to include more details of the
Start writing up spec
Add licence
Tool for processing MIME documents (in particular, emails) and storing in a nicer format.
Each section of the document will be processed according to it's type e.g. converting quoted-printable into UTF-8 and converting base64 to binary. Each section is then packed into a tarball.
Clone then ./configure; make; sudo make install
. This will automatically enable pandoc for converting HTML to MD if pandoc is installed. Binary is installed to /usr/local/bin/crinkle
.
Raw MIME documents are passed to the binary via stdin or a filename as an arg. Output is the file crinkle [input] -o <output>
. Note that /dev/stdin
is a valid output so crinkle can be used in an SMTP pipeline.
This document specifies an open format for storing emails as files on disk. It provides particular guarantees allowing easy and reproducible email parsing and handling.
This is a living specification and is developed in conjunction with this repo. It is currently beta quality and should stabilise in the future.
Emails are transmitted and recorded in MIME format. This format is specified in RFC2045, 2046, 2047, 4288, 4289, and 2049. Unfortunately, in the real world MIME implementations do not strictly follow the spec. For example, headers may be improperly formatted, date formats are arbitrary, and so on. Even without these issues the format suffers from decades old poor design: it is being used far beyond what the creators envisioned. Significant issues include:
bastion aims to resolve some of the security issues. bastion relies on crinkle to store messages robustly, simply and reliably. crinkle solves all the above issues. Of course it may be used elsewhere.
crinkle is very simple and is designed to be handled with standard UNIX-like utilities:
By using TOML, all issues parsing MIME headers are avoided and instead one may use existing libraries. TOML is a recent, well defined, simple format for storing key/value text. By separating out and processing each section, one can easily view and further handle email bodies. By separating out the attachments one can immediately serve to a user. Packing into a tarball uses another standard format with archival guarantees.
There are two types of MIME documents: plain and multipart. Plain messages consist of a sequence of key: value
pairs (e.g. from
, to
, subject
fields; the header), at least 2 new lines, and then a body:
key0: value0
key1: value1
# ...more key-value pairs...
Hello, world!
Note about subtleties: Header fields have the form key: value
, with value containing all possible characters, including newlines. This means that parsing a header is slightly more complicated than one might think initially. Headers can also contain duplicate fields.
Multipart messages consist of a header much the same as for plain, and one or more (possibly nested) parts (or sections). All sections have some text type and are specified by a boundary defined by the content-type in the header e.g. Content-Type: multipart/mixed; boundary="0000000000008a01e4059229eec0"
. Each section contains a short header (defining e.g. encoding and other metadata) and some body. For example, an email could have a plain-text encoded section, a HTML-encoded section, and a base64-encoded attachment:
key0: value0
key1: value1
# ...more key-value pairs...
Content-Type: multipart/mixed; boundary="0000000000008a01e4059229eec0"
--0000000000008a01e4059229eec0
Content-Type: multipart/alternative; boundary="0000000000008a01e1059229eebe"
--0000000000008a01e1059229eebe
Content-Type: text/plain; charset="UTF-8"
Hello, world!
--0000000000008a01e1059229eebe
Content-Type: text/html; charset="UTF-8"
<div dir="ltr">Hello, world!<br></div>
--0000000000008a01e1059229eebe--
--0000000000008a01e4059229eec0
Content-Type: image/png; name="Lenna_(test_image).png"
Content-Disposition: attachment; filename="Lenna_(test_image).png"
Content-Transfer-Encoding: base64
Content-ID: <f_k0d8idqy0>
X-Attachment-Id: f_k0d8idqy0
# snip
--0000000000008a01e4059229eec0--
Because multipart messages can be arbitrarily nested, and may contain large blocks of encoded text (e.g. attachments), parsing can be slow and complicated. Further, text is usually encoded in some weird way e.g. quoted-printable, so everything has to be processed to actually be useful to the user.
crinkle separates out a MIME document into a TOML-encoded header file, and then each section into separate files. Attachments are stored in their native format. Text sections are stored in some sensible format (HTML, markdown, plaintext; all utf-8). The resultant files are all then stored in a tarball. bastion stores emails in this format; each file is independently encrypted. crinkle adds some metadata to the header to note the structure of (possibly nested) sections.
Hence, our goals simply are to
This repo contains a lib under crinkle
, and two binaries under src/bin
. The crinkle
binary simply takes some MIME as input and outputs a crinkle tarball. uncrinkle
does the opposite.
Headers are a TOML table i.e. key/values. Each value may be any valid TOML type. Sections may be plaintext or HTML. There is also an experimental feature which uses pandoc to convert HTML to markdown. All encoding (e.g. quoted printable, base64, is stripped).
These may be multiline for long headers.
Example: subject
.
Example: num_sections
.
Example: emailer fields, from
, to
; attachments
.
Valid table types are:
{email = "example@example.com", name = "example"}
{filename = "foo.jpg", content-type = 'foo/bar', size = 1}
{...}
Currently, rust-toml does not support dates. So for the moment these are stored as a string, in the format 2019-09-01 18:47:17 UTC
. In the future a proper date type will be implemented.
All header values are processed:
Headers are deduplicated, to satisfy TOML requirements. - MIME messages may contain repeated keys, each with a different value - In these cases, each value is collected, processed, and put into an array.
Example:
received = ["""from o3.email.bandcamp.com (o3.email.bandcamp.com [198.21.0.215])
by example.com (OpenSMTPD) with ESMTPS id a1fa99f4 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO)
for <example@example.com>;
Sun, 1 Sep 2019 18:47:11 +0000 (UTC)""", """by filter0116p1iad2.sendgrid.net with SMTP id filter0116p1iad2-543-5D6C1235-13
2019-09-01 18:47:17.604755642 +0000 UTC m=+429561.889968037""", """from bandcamp.com (ef.82.3da9.ip4.static.sl-reverse.com [169.61.130.239])
by ismtpd0009p1sjc2.sendgrid.net (SG) with ESMTP id K7NdDks5Rp-hj6zekhaxVQ
for <example@example.com>; Sun, 01 Sep 2019 18:47:16.923 +0000 (UTC)"""]
Single-line values are stored as single-line TOML strings. Example:
return-path = "<bounces+715912-0519-example=example.com@email.bandcamp.com>"
Multi-line values are stored as multiline TOML strings. Example:
dkim-signature = """v=1; a=rsa-sha1; c=relaxed/relaxed;
d=email.bandcamp.com;
h=from:to:subject:reply-to:mime-version:content-type; s=smtpapi;
bh=GMq8o6onqHwkbKkFgqnrnpHezjc=; b=zdXYpJZGtSuUMv2xRDL3DhRYrQwXZ
U8crpl/b+TLRc+h/GZcddBH1Mw6kg+FAs5Nuy1npOE7d3zXACBDE95hya/RkcF8Q
ya7fvewMlGBBfw8ZFRhukaDnTYc9GdyWn/rd5K33a8g7QdlfVpeL++x5sFAcyfVO
VkPI6RaTPa89DM="""
If there is ambiguity, default to multiline strings.
content-type
is represented as a table. Example:
content-type = {boundary = "it_was_only_a_kiss", mime_type = "multipart/alternative"}
An additional num_sections
element is added, which counts the number of sections.
An additional attachments
element is added. This is an array, with elements of type Attachment.
---
content-type = {charset = "utf-8", mime_type = "text/plain"}
content-transfer-encoding = "quoted-printable"
section-id = "a187c745-2365-4c9c-b7b2-8f341706e459"
parent-id = "root"
---
Sections have a body and a header. The header is described above. The body is UTF-8 text. It may be:
Nested sections are flattened. The tree can be reconstructed based on parent-id
fields in the header. Top level sections have parent root
.
Attachments are raw bytes. On extracting the tarball, the file is immediately valid.
content_disposition.rs
contains some classes for parsing the content-disposition field.content_type.rs
contains some classes for parsing the content-type field.emailer.rs
contains some classes for parsing the email-related fields. There is room to replace the parser in here with something more robust.lib.rs
contains the bulk of the library.section.rs
contains classes related to parsing sections. To be general, I use a tagged enum. A Section
may be the body from a plain document (i.e. just some text), a section from a multipart document (i.e. some text and a header), an attachment (i.e. some data and metadata), or empty.test.rs
/test/
contains tests. Obviously.I use crumble to first parse the document. This is a minimal parser that returns the headers, and unprocessed sections. Crinkle takes this data and flattens the sections, processes them appropriately, and then repacks everything as a tarball. It's then straightforward to reverse this (see the uncrinkle
function).
There is a large block at the start of the file that is mostly concerned with converting headers into a TOML format. The crinkle
functions outputs crinkle tarballs, and uncrinkle
the reverse. Everything else is concerned with processing the MIME document.
ContentDisposition
A disposition, and a vector of key/value pairs. Dispositions:
Inline,
Attachment,
FormData,
Signal,
Alert,
Icon,
Render,
RecipientListHistory,
Session,
AIB,
EarlySession,
RecipientList,
Notification,
ByReference,
InfoPackage,
RecordingSession,
Empty,
Unknown,
The key/value pairs are parameters, found after the ;
in the MIME.
ContentType
A MIME type, and a vector of key/value pairs. These are parameters, found after the ;
in the MIME.
Emailer
A name and an email.
to: foo@example, bar@example
) are converted to an array of Emailer
objects.:
, or quotes are wrapped in triple quotes. Otherwise, they are wrapped in single quotes.---
and then the body.Section
by adding a small header with appropriate metadata.parent
fields added to each header.num-sections
field is added to the main header.attachments
field is added to the main header.GPLv3+