~ivilata/gwit-spec

gwit - Web sites over Git (specification)
Set `GNUPGHOME` environment variable when Git verifies signatures.
Use less confusing example for gwit site branch name.
Update project description and target sites.

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~ivilata/gwit-spec
read/write
git@git.sr.ht:~ivilata/gwit-spec

You can also use your local clone with git send-email.

#gwit - Web sites over Git

This specification is part of an experiment in turning Git repositories into a minimalist Web of replicated, verifiable, durable sites. It defines how to link to content in any gwit site regardless of where it is hosted, and it provides a means for a site to help discover other gwit sites and access them for the first time. See the gwit project page for a high-level introduction, design goals and other documentation.

gwit inherits many of Git's distributed properties:

  • Someone who accesses a gwit site for the first time gets a full copy (a Git clone) of it, including previous versions.

    The whole site becomes thus available without further network access, enabling offline reading and search.

  • Updates to the site may be fetched from other locations which host copies of the site (Git remotes).

    A location may be some local external media (like a USB drive), allowing sneakernet scenarios. Also, a local copy of a gwit site may be made available to others, thus increasing site availability (also for archival and censorship circumvention purposes).

  • To verify the authenticity of site content coming from diverse locations, gwit makes use of Git's support of PGP signatures over commits.

gwit is intended for static lightweight sites, with a majority of textual content, and for existing Git repositories. It does not try to cover every possible use case, so as to stay as simple as possible while being useful enough.

This project is heavily inspired in Solderpunk's article Low budget P2P content distribution with git (Gemini link, Web link). We recommend reading that article to understand the reasons and decisions behind the project.

The specification is waived into the Public Domain by its authors under the Creative Commons CC0 1.0 Universal license.

Note: This specification is a work in progress, and it may change in backwards-incompatible ways. Please take that into account for any implementation based on it. If you want to participate in the evolution of the specification, please check the gwit-spec mailing list.

Note: The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 RFC2119 RFC8174 when, and only when, they appear in all capitals, as shown here.

Note: The text below contains some POSIX shell commands. They are only provided for illustrative and clarification purposes, so implementations may choose other approaches (for instance, they may use Git working trees instead of bare repositories as in the example commands below).

#About the name "gwit"

The word "gwit" (pronounced [gu̯it]) reflects the "Web in Git" concept, but it is also a light pun on the recurring mispronunciation of other project names like GNU Guix and guifi·net.

#Basic concepts

A gwit site is just a Git repository branch associated with a PGP public key whose private keys are used to sign commits in that branch. While that site key is public, its private keys are only owned by the site author. Since that site branch can be fetched from different locations managed by people other than the author, a location is not used to identify the site. Instead, the full fingerprint of the site key is used as its identifier (usually represented as a string of 40 hexadecimal digits, lower or upper case).

This means that the relation between site and site key is one-to-one: if an author wants to create another site, then a new site key MUST be created. Thus, using one's day-to-day PGP key as a site key is NOT RECOMMENDED. The mechanisms to relate a site (and its key) to a particular identity outside of gwit are out of the scope of this specification.

As gwit site identifiers are not meaningful nor memorable to humans, some support is provided to allow using petnames for sites. This specification uses the concepts of petname, edge name, and (self-)proposed name from the paper Petnames: A humane approach to secure, decentralized naming.

To get content from a site, one needs to access an existing copy of it. That copy MUST be a Git repository, and its location MUST be expressed as a URL allowed by Git for a remote. Though the URL is opaque to gwit (e.g. it may use whatever Git-supported protocol), the associated remote SHOULD be accessible without external credentials like passwords.

Since gwit is based on Git, a gwit site is made up of static files and directories. Except for a few files with site metadata (described below), the specification does not mandate any structure or file types.

gwit defines a URI format (described further below) that allows to refer (link) to a file or directory by its path in a site indicated by its identifier. The URI may apply to an optional specific version of the site.

Note: As an example of how to bind the author's day-to-day key to a particular site, the latter may include some statement, signed by the author's day-to-day key, claiming ownership of the site key by its fingerprint. Or following the Ariadne Identity Specification, the day-to-day key may include an identity claim with a gwit URI pointing to a file in the site that contains an identity proof for the key.

Note: gwit only supports one key per site. That limitation means that the site's private keys must be shared between different people if the site is to have multiple authors (PGP signing subkeys may be used to avoid sharing the primary private key, but they lack their own user IDs). Though there are systems (like Guix' secure updates) which support multiple authorized keys in a safe manner, they would make gwit much more complex in aspects like content authentication or revocation of past commits by the site author.

#Site requirements

The site branch associated with a gwit site MUST be named gwit-0x........, where the dots represent the lower case short ID of the PGP site key, i.e. the last 8 hexadecimal digits of the site identifier in lower case. Example: gwit-0x76543210.

This means that the same Git repository may hold different but related gwit sites, each one with its own branch and key. For instance, while the master or main branch may contain common sources for a static site generator, generated Gemini and Web files may go to separate gwit site branches.

Given the commit at the head of a site branch in a Git repository, its top directory MUST include a .gwit directory (dot gwit), which in turn:

  • MUST include a self.key file containing the site key (and any signing subkeys) in OpenPGP public key format, ASCII-armored or not (e.g. the output of gpg --export [--armor] <SITE-KEY>). Although the primary key itself SHOULD NOT change, subsequent updates to .gwit/self.key MAY add new subkeys, identities, signatures, revocations and other metadata.
  • SHOULD include a self.ini file with the site configuration. Its contents are described further below.

Also, the commit MUST be signed by the private key associated with the site key, or by a signing subkey of it.

The head commit of a site branch and its ancestors constitute the different versions in the known history of the site. Non-head ancestor commits are not required to meet the conditions of the head commit: for instance, they may be commits previous to the creation of the site branch, intermediate work commits, or commits merged from other branches into the site branch (in the latter case, the author may use the merge commit to fulfil head commit requirements).

No restrictions are placed upon content files themselves, but it is RECOMMENDED that each file name includes an extension that helps programs recognize the type of the file (as in document.txt, page.html or image.jpeg).

#Site configuration file

.gwit/self.ini has the same format as Git configuration files, which can be summarized as an INI file where subsection definitions have a [section-name "subsection-name"] format. It MUST be encoded using UTF-8, all its values MUST be considered as simple strings (i.e. no special parsing of integers or pathnames), and includes MUST be disabled. Each single encoded value MUST NOT exceed 1000 bytes (unless otherwise stated below), values with multiple occurrences MUST NOT have more than 10 single values, and the whole file MUST NOT exceed 65536 bytes.

Recognized sections and values are described below, and unknown ones SHOULD be ignored. If a value marked as "single" is assigned more than once in the file, then the last assignment is used.

The [site "<ID>"] subsection of .gwit/self.ini contains some basic information about the gwit site, meant for its readers except where otherwise noted. Its <ID> MUST be the identifier of the site itself, encoded as 0x plus the lower case hexadecimal digits of the full fingerprint of the PGP site key. Example: [site "0xfedcba98765432100123456789abcdef76543210"]. Values recognized in the subsection are:

  • name (single, recommended): A short name or handle for the site. It MUST NOT (i) be empty or consist only of whitespace characters, (ii) contain newline or control characters, (iii) start with 0x or 0X. Example: "Foo Bar".

    A gwit client MAY regard the site name as a self-proposed name for the site; as such, the client SHOULD allow configuring a petname value that overrides it along other proposed names for the site.

  • title (single, optional): A short text (in an unspecified language) to identify the site. It MUST NOT contain newline characters. Example: "Foo Bar: the Bar for all your Foos".

  • title-<LANGUAGE> (single, optional): A language-specific title for the site, with the same characteristics as title. <LANGUAGE> MUST be a two-letter ISO 639-1 language code. Example for title-fr (French): "Le Bar de Foo : le Bar pour tous vos Foos".

  • desc (single, optional): A longer text (in an unspecified language) describing the site, maybe over several lines or paragraphs. Its encoding MUST NOT exceed 4000 bytes.

  • desc-<LANGUAGE> (single, optional): A language-specific description for the site, with the same characteristics as desc. <LANGUAGE> MUST be a two-letter ISO 639-1 language code.

  • license (single, recommended): A short text hinting about the legal terms of use for the site, if meaningful. It MUST NOT contain newline characters. Example: "CC-BY-4.0" (meaning "Creative Commons Attribution 4.0 International" as per the SPDX License List).

  • root (single, optional): A directory to be used as the site's root directory instead of, and relative to, the commit's top directory. If missing, it defaults to that top directory. It MUST consist of one or more non-empty path components separated by a single forward slash (/). It MUST NOT contain . or .. path components. This is convenient when using a static site generator that writes its output to a directory. Example (for such a generated site): output.

  • index (single, optional): The name of the index file. It MUST NOT be empty, . or .., or contain slash characters (/). When a gwit client is told to retrieve a directory, and it contains a file named as the index file, the contents of the file SHOULD be produced instead of a directory listing. Example (for a site containing Gemini files): index.gmi.

  • remote (multiple, recommended): A location recommended by the author for retrieving the site, the URL of a Git remote. Multiple such locations may be given (for increased availability), each as a different remote value, which a client MAY consider in order of appearance. Example: https://git.example.net/foo/bar-site.git.

  • alt (multiple, optional): If given, the prefix for this site's URIs in a publication system other than gwit. The gwit client MAY interpret links in this site using those prefixes as if they began with a single slash (/) instead of the prefix and subsequent slashes. This enables reusing site contents in gwit without needing to adapt local absolute links. Multiple such prefixes may be given, each as a different alt value. Example: https://foo.example.net/bar/ enables rewriting https://foo.example.net/bar//page.html to /page.html.

The scope of the different site configuration values is described below:

  • site.<ID>.name: The value from the head of the site branch, if defined, SHOULD be read on initial site retrieval, then applied to all versions of the site (past and future, until manually overridden).
  • site.<ID>.title(-*), site.<ID>.desc(-*): The value from the head of the site branch, if defined, MAY be applied to previous versions as well, though the value in a specific version, if defined, SHOULD be applied to that version.
  • site.<ID>.license, site.<ID>.root, site.<ID>.index, site.<ID>.alt: The value in a specific version, if defined, SHOULD be applied only to that version.
  • site.<ID>.remote: The value from the head of the site branch, if defined, MAY be applied on initial site retrieval and updates. The handling of old values is at the discretion of the gwit client.

This is a sample .gwit/self.ini file using all sections and values:

[site "0xfedcba98765432100123456789abcdef76543210"]
name = Foo Bar
title = Foo Bar: the Bar for all your Foos
title-fr = Le Bar de Foo : le Bar pour tous vos Foos
desc = "It's the Foo Bar!\n\nFind your best Foos here."
desc-fr = "C'est le Bar de Foo !\n\nTrouvez vos meilleurs Foos ici."
license = CC-BY-4.0
root = output
index = index.gmi
remote = https://git.example.net/foo/bar-site.git
remote = https://lab.example.org/foo-mirror/bar-site.git
remote = https://hut.example.org/foo-mirror/bar-site
alt = https://example.net/~foo/bar/
alt = https://foo.example.net/bar/
alt = gemini://foo.example.net/bar-site/

#Site introductions

A site's .gwit directory may also contain site introductions, which allow the site author to provide the information needed for the retrieval of other gwit sites. This is the main means of content discovery in gwit, thus site authors SHOULD provide such introductions for the sites that they link to.

An introduction for a given site MUST be contained in the file .gwit/<ID>.ini, where <ID> is the identifier of the introduced site, encoded as 0x plus the lower case hexadecimal digits of the full fingerprint of the PGP site key. Example: .gwit/0x0123456789abcdef0123456789abcdeffedcba98.ini.

The format and features of a site introduction file are those of a site configuration file (see further above). For introducing a site with identifier <ID>, the introduction file MUST contain a [site "<ID>"] subsection (the introduction proper), which MUST define at least one site.<ID>.remote value. The site identifier in the file name .gwit/<ID>.ini MUST match that in the file's [site "<ID>"] subsection.

While the value of site.<ID>.remote may be used for retrieving the introduced site, the rest of values may be considered as mere hints (since there is no guarantee that they come from that site's author), and they SHOULD be overridden by the client with the equivalent values of the actual site configuration file, once available locally.

Also note that a gwit client MAY regard an introduction's site.<ID>.name as this site author's proposed name for that site (its edge name); as such, the client SHOULD allow configuring a petname value that overrides it along other proposed names for the site.

This is a sample introduction, stored in the .gwit/0x0123456789abcdef0123456789abcdeffedcba98.ini file:

[site "0x0123456789abcdef0123456789abcdeffedcba98"]
name = Someone's site
desc = The site that Someone published while studying at the University.
remote = https://hub.example.com/someone/my-gwit-site.git
remote = https://lab.example.org/s.one/gwit-site.git

#Site retrieval and content authentication

#Initial retrieval

If someone wants to use a client program to retrieve a gwit site for the first time, then the client MUST know:

  • The site identifier, i.e. the site key fingerprint.
  • The location of an accessible copy of the site.

These ID/location pairs may be conveyed to the client via different methods (like person-to-person, search engines, or site directories), however this specification only covers the discovery mechanism described further above, where each site can provide a number of introductions for other sites with their respective ID and locations. At any rate, the choice among a variety of available locations for the initial retrieval of a particular site is up to the implementation.

To retrieve a site for the first time, given <SITE-ID> as its identifier (a string of hexadecimal digits), <SITE-LOCATION> as its location (the URL of a Git remote), and <SITE-BRANCH> as its branch (derived from <SITE-ID> as described further above), the gwit client MUST clone the Git repository at <SITE-LOCATION> and verify that the head of the site branch is signed by the key matching <SITE-ID>. An implementation may follow the steps below, or some others with equivalent results:

  1. Clone the Git repository from the given location into temporary storage (e.g. git clone --bare --branch <SITE-BRANCH> <SITE-LOCATION> <TEMP-REPO> && cd <TEMP-REPO>).
  2. Get the commit at the head of the site branch as <HEAD-COMMIT> (e.g. git show-ref --verify --hash refs/heads/<SITE-BRANCH>).
  3. Check that .gwit/self.key exists as a file (blob) in <HEAD-COMMIT> (e.g. git ls-tree --format='%(objecttype) %(objectname)' <HEAD-COMMIT> .gwit/self.key reports blob <KEY-FILE-HASH>).
  4. Check that the fingerprint of the primary PGP key in .gwit/self.key is equal to <SITE-ID> (case-insensitively) (e.g. git cat-file blob <KEY-FILE-HASH> | gpg --show-keys --with-fingerprint --with-colons | grep -A1 '^pub:' | grep -qiE '^fpr:+<SITE-ID>:$').
  5. Import .gwit/self.key into the client's keyring (e.g. git cat-file blob <KEY-FILE-HASH> | gpg --homedir <CLIENT-GPG-DIR> --import).
  6. Check that <HEAD-COMMIT> has a valid signature by the key that matches <SITE-ID> (case-insensitively), or by a subkey of it (e.g. env GNUPGHOME=<CLIENT-GPG-DIR> git verify-commit --raw <HEAD-COMMIT> 2>&1 | sed -nE 's/^\[GNUPG:\] VALIDSIG .*\b(\S+)$/\1/p' reports <SITE-ID>).
  7. Save the temporary clone into persistent client storage.

Any error or failed check in the previous steps would cause the process to stop at the current step, discard any temporary data, and report an error.

After the previous steps, the client MAY access the .gwit/self.ini file in the head of the site branch (e.g. git cat-file blob <HEAD-COMMIT>:.gwit/self.ini) and apply any relevant configuration values (see further above).

Note: Example commands using git verify-commit --raw <COMMIT> report the fingerprint of the primary key of the key used to sign the commit. An alternative approach would be to get the signing key (e.g. git show --no-patch --format=format:%GK <COMMIT> as <SIG-KEY>), check that it is (a subkey of) the key that matches <SITE-ID> (e.g. gpg --homedir <CLIENT-GPG-DIR> --list-keys --with-fingerprint --with-colons <SIG-KEY> | grep -A1 '^pub:' | grep -qiE '^fpr:+<SITE-ID>:$'), then just run env GNUPGHOME=<CLIENT-GPG-DIR> git verify-commit <COMMIT>.

#Site updates

If someone wants to retrieve updates to a gwit site identified by <SITE-ID> for which they already have a Git clone in persistent client storage, then the gwit client MUST choose one of its remotes <REMOTE>, fetch new items from it (including site key updates), verify that the new head of the site branch <SITE-BRANCH> (derived from <SITE-ID> as described further above) is signed by the key matching <SITE-ID> and a successor of its current head, and then point the site branch to its new head. An implementation may follow the steps below, or some others with equivalent results:

  1. Get the commit hash of the current head of <SITE-BRANCH> as <OLD-HEAD> (e.g. git show-ref --verify --hash refs/heads/<SITE-BRANCH>).
  2. Try to fetch new objects from <REMOTE> (e.g. git fetch --atomic --no-write-fetch-head <REMOTE> '+refs/heads/*:refs/remotes/<REMOTE>/*'; this preserves all fetch heads for each remote).
  3. Get the commit hash of the new head as <NEW-HEAD> (e.g. git show-ref --verify --hash refs/remotes/<REMOTE>/<SITE-BRANCH>).
  4. Check that <NEW-HEAD> is not an ancestor of the current head (e.g. not git merge-base --is-ancestor <NEW-HEAD> <OLD-HEAD>). If it is, then <REMOTE> does not contain newer content.
  5. Update the site key in the client's keyring (e.g. to allow new signing subkeys) from the .gwit/self.key file in <NEW-HEAD> (e.g. git cat-file blob <NEW-HEAD>:.gwit/self.key | gpg --homedir <CLIENT-GPG-DIR> --import-options merge-only --import).
  6. Check that <NEW-HEAD> has a valid signature by the key that matches <SITE-ID> (case-insensitively), or by a subkey of it (e.g. env GNUPGHOME=<CLIENT-GPG-DIR> git verify-commit --raw <NEW-HEAD> 2>&1 | sed -nE 's/^\[GNUPG:\] VALIDSIG .*\b(\S+)$/\1/p' reports <SITE-ID>).
  7. If the current head is not an ancestor of <NEW-HEAD> (e.g. not git merge-base --is-ancestor <OLD-HEAD> <NEW-HEAD>), then <REMOTE> contains a site history rewrite. This scenario is supported by the specification, and this step may or may not succeed depending on different conditions (see further below).
  8. Update the head of <SITE-BRANCH> in the clone to <NEW-HEAD> (e.g. git update-ref refs/heads/<SITE-BRANCH> <NEW-HEAD>).

Any error or failed check in the previous steps would cause the process to stop at the current step, discard any temporary data, and report an error. If the Git clone includes additional remotes, the client MAY choose to repeat the procedure with another one in case of error, or to look for newer content.

After the previous steps, the client MAY access the .gwit/self.ini file in the <NEW-HEAD> commit (e.g. git cat-file blob <NEW-HEAD>:.gwit/self.ini) and apply any relevant configuration values (see further above). In particular, a change in site.<SITE-ID>.remote MAY trigger another update with the new value (e.g. after git remote set-url origin <NEW-REMOTE>).

#Site history rewrites

When an author updates a gwit site with new Git commits, these are added to the site's Git repository while keeping existing commits intact. This means that content modified or removed by site updates is still available from previous commits to those who retrieve the site. However, the author may have legitimate reasons to remove some content from previous versions of the site and to ask others not to propagate that content (e.g. to comply with some law or to avoid the diffusion of sensitive information included by accident).

gwit offers some support for this use case, based on good faith and good citizenship (as technical enforcement would add much complexity for dubious gain). The site author may publish a rewritten site history (e.g. using a forced Git push) as an alternate set of Git commits that avoid the undesired content, while still carrying valid signatures for that site. While retrieving the site anew from that repository or some clone of it will only yield the new history, updating a local copy which already contains the old history will offer the option to accept or decline the new history.

As observed in the previous section, a gwit client retrieving updates for a site branch may detect that its current head is not an ancestor of the remote branch's new head. In that case, the client SHOULD allow choosing whether to accept the remote's commit as the new branch head (and thus the history rewrite), or to discard it and keep the current one. Furthermore, in the first case the client SHOULD offer the option to clean up no longer reachable objects from the clone to also remove the undesired commits from client storage (e.g. with git gc, once the site branch has been updated).

Although site history rewrites (and subsequent cleanups) should be accepted in the general case as a deference to site authors, there may be legitimate reasons not to do so (e.g. for archival or investigative purposes). Besides, rewrites have other issues: they can break permanent links (see below) and workflows which depend on tracking updates to a site; also, comparing old and new history after a rewrite can help reveal the undesired data. Acceptable use of a site's history is out of the scope of this specification and up to the communities using gwit sites.

#Security considerations

  • As a general protection measure, a gwit client SHOULD retrieve content from other clones using the mechanisms described above (instead of copying their content straight into its own storage), as they may contain malicious hooks, tags, branches and others.

  • OpenPGP implementations like GnuPG require that keys be imported into a keyring before using them to verify signatures, which means that .gwit/self.key must be imported before verifying its own authenticity on initial site retrieval and updates. A gwit client MAY perform extra verifications on .gwit/self.key (e.g. with gpg --show-keys) before importing it, or it MAY set a temporary keyring (e.g. via GnuPG's GNUPGHOME environment variable) to import .gwit/self.key and verify commit signatures (initial retrieval steps 5-6 and site update steps 5-6), then import .gwit/self.key again into the client's keyring if the verification succeeded.

  • Depending on the implementation of Git, some operations expecting a commit or object name (hash) may instead act upon a tag or branch with the same name. This behavior may allow certain attacks, e.g. the site author may craft signed tags to avoid history rewrite detection in a client when retrieving site updates, or to trick a client into importing a .gwit/self.key file in a commit different from the head of the site branch on initial site retrieval; other attackers may insert unsigned tags or branches in their public clones that cause errors in clients using them as remotes.

    As a way to fend off these attacks, clients SHOULD warn about and remove Git tags and branches with names matching the format of hashes used by the repository (either 40 or 64 hexadecimal digits for SHA-1 or SHA-256, lower or upper case) right after cloning a it (initial retrieval step 1) or fetching new objects (site update step 2), as those tags and branches are certainly malicious.

#URI format

A gwit URI refers to a target file or directory in a given gwit site. It follows the syntax specified in RFC3986 (except for an ad-hoc authority part) and has the general format

gwit://[<VERSION>@]<SITE><PATH>[#<FRAGMENT>]

with parts in square brackets being optional, and where

  • <SITE> indicates the target gwit site. It is the site identifier, encoded as a string of hexadecimal digits (case-insensitive) prefixed with 0x or 0X. Shortened variants of site key fingerprints (as accepted as key identifiers by some PGP implementations) MUST NOT be allowed, as they would weaken site authentication and open up attack vectors (esp. on initial retrieval).

    Links found inside of a gwit site may also use the string self (case-insensitive) for <SITE>, which allows the site to easily link to a particular version of itself (i.e. gwit://<VERSION>@self<PATH>…). When parsing such a URI, a gwit client MUST first replace self with the site identifier as described above. A URI using self MUST NOT be allowed outside of a site, and a gwit client SHOULD replace self with the site identifier when exporting it (e.g. when copying the URI to the clipboard).

  • <VERSION> specifies a particular version of the target site. It MUST be percent-encoded if it contains reserved characters as per Sections 2.1 and 2.2 of RFC3986.

    When missing or empty, the URI refers to the most recent site version which is known to a client when it accesses the site (i.e. the head of the site branch in the client's Git clone of the site).

    <VERSION> may be the object name (hash) of a Git commit in the site's history, encoded as a string of hexadecimal digits (case-insensitive), in which case the URI refers to that commit. The name may be shortened by removing characters from its end, but this may cause content retrieval to fail if the client's Git clone of the site contains several commits with that same shortened name.

    <VERSION> may also be a Git tag, branch or other revision pointing to a commit in the client's Git clone of the site, in which case the URI refers to that commit.

    A tag signed by the site key may be used to succinctly convey a relevant point in the history of a site (like a release name). In contrast, branches and other tags, as well as other Git revision expressions cannot be authenticated and their names may vary between clients, thus URIs using them SHOULD be regarded as unsafe in the general case. gwit clients MAY still support them as they can be useful for site debugging or authoring.

  • <PATH> is the absolute path of a file or directory in the site version referenced by the previous parts of the URI. The root of the path is the site's root directory (as per site configuration) in the Git commit corresponding to the desired site version, so the path maps to a Git blob object or tree object reachable from it.

    Components in <PATH> are separated by a forward slash (/), with the root directory being the empty string. A series of contiguous forward slashes is equivalent to a single slash. When <PATH> refers to a directory it is RECOMMENDED to append a forward slash (to avoid issues with relative links). Thus both an empty <PATH> and a <PATH> consisting of a single forward slash refer to the root directory.

    Except for the root directory, a component in <PATH> is a non-empty sequence of bytes that map directly to the bytes of its associated path name in a Git tree object. Bytes in the component corresponding to non-ASCII characters or reserved URI characters MUST be percent-encoded (as per Sections 2.1 and 2.2 of RFC3986). For instance, the Git path name foo+f\xFCr+bar would be encoded as foo%2Bf%FCr%2Bbar in a gwit URI. Clients SHOULD NOT further encode or decode the byte sequences with other encodings (like UTF-8 to get a Unicode string).

  • <FRAGMENT>, when present and not empty, indicates some secondary resource inside of the primary resource referenced by the rest of the URI. Its interpretation is up to the gwit client, agreeing with standards applicable to the particular media type.

A link consisting of a URI with both a site identifier and a full commit hash as used in the particular Git repository (either 40 or 64 hexadecimal digits for SHA-1 or SHA-256, lower or upper case) is called a permanent link, as it uniquely identifies an exact, immutable object in a site regardless of subsequent updates to it. These links may be preferred for certain applications like long-term archival or citation. However, a site history rewrite (see further above) may render such a link unavailable to others, so applications relying on permanent links may want to handle rewrites in a special manner.

Some URI examples:

  • gwit://0x0123456789abcdef0123456789abcdeffedcba98/ links to the root directory of the most recent known site version.
  • gwit://0x0123456789abcdef0123456789abcdeffedcba98/posts.html#latest links to the HTML element with ID latest in the file posts.html of the most recent known site version.
  • gwit://9c359d88d4882d17d673a7fb89c9af8349a4fb7c@0x0123456789abcdef0123456789abcdeffedcba98/breaking-news.gmi is a permanent link to the file breaking-news.gmi of version (Git commit) 9c359d88d4882d17d673a7fb89c9af8349a4fb7c of the site (whose repository uses SHA-1 hashes).
  • gwit://9c359d88@0x0123456789abcdef0123456789abcdeffedcba98/tag/cats/ links to the directory tag/cats in the same site version as above (in shortened notation, thus not a permanent link).
  • gwit://v1.0@0x0123456789abcdef0123456789abcdeffedcba98/NEWS.txt, with v1.0 being a Git tag signed by the site key, links to the file NEWS.txt in the commit pointed by that tag.
  • gwit://v1.0@self/NEWS.txt (or //v1.0@self/NEWS.txt) is the same link as above, but only when found inside of that same site.
  • gwit://my-colleague%2fprototype@0x0123456789abcdef0123456789abcdeffedcba98/new-stuff.gmi, with my-colleague/prototype being some remote-tracking Git branch of the site known to the gwit client, links to the file new-stuff.gmi in the head of that branch.

Resolving references (i.e. links) relative to a given base gwit URI works as specified in Section 5 of RFC3986, for example:

gwit://<SITE>/foo/bar + baz (or ./baz) => gwit://<SITE>/foo/baz
gwit://<SITE>/foo/bar/ + baz (or ./baz) => gwit://<SITE>/foo/bar/baz
gwit://<SITE>/foo/bar + ../baz => gwit://<SITE>/baz
gwit://<SITE>/foo/bar/ + ../baz => gwit://<SITE>/foo/baz
gwit://<SITE>/foo/bar + /baz => gwit://<SITE>/baz

As site identifiers can make gwit URIs quite long, authors may take advantage of such relative links to shorten internal site references (instead of using absolute URIs); they may also make site content more portable between publication systems.

Normalizing and comparing gwit URIs also works as specified in Section 6 of RFC3986. For comparison purposes, a shortened commit hash is only considered equal to itself (case-insensitively), that is:

gwit://abcdef01@<SITE>/foo == gwit://ABCDEF01@<SITE>/foo
gwit://abcdef01@<SITE>/foo != gwit://abcdef012345@<SITE>/foo

A gwit client MAY make a gwit URI more readable to humans by showing a petname associated with the URI's site ID. If possible, such client SHOULD show the chosen petname in some kind of transient UI widget (like a tooltip or status bar message when hovering a link), or when rendering the containing document (e.g. along its title or URI in a link list). If that is not possible, it MAY alter the produced document to add petnames to link titles. It MUST NOT alter the links themselves in the document, as that may break its parsing.

A client MAY display a petname-decorated view of a gwit URI. Such representation MUST NOT be exchanged or exported outside of an application, as it may not make sense to other people (e.g. copying the URI to the clipboard should still provide the original URI). Moreover, the client SHOULD make such representation visually distinct from a plain URI to avoid confusion (e.g. by emphasizing petnames in some manner).

As mentioned further above, a gwit client may learn the self-proposed name of a site from its configuration file, as well as the edge names of introduced sites. In that case, it should also allow to set a different petname for any such site.

For instance, Alice retrieves Bob's site (with ID <BOB-ID>) for the first time using her gwit client. That site's .gwit/self.ini file sets Bob's site as the value of site.<BOB-ID>.name; the site also contains an introduction of Carol's site (with ID <CAROL-ID>) having This is Carol as the value of site.<CAROL-ID>.name.

Alice's gwit client follows the petname implementation hints described in the paper Implementation of a petnames system in an existing chat application. Thus, when the client finds a link to

  • gwit://abcdef@<BOB-ID>/foo/bar it shows it as
  • abcdef@⁣┊?⁣Bob's site⁣┊⁣/foo/bar

When it finds a link to

  • gwit://<CAROL-ID>/test/page it shows it as
  • ┊☞⁣Bob's site⁣⇒⁣This is Carol⁣┊⁣/test/page

Alice sets the shorter petname "Bob" for Bob's site, so that the previous links respectively show as

  • │⁣Bob⁣│⁣/foo/bar
  • │☞⁣Bob⁣⇒⁣This is Carol⁣│⁣/test/page

Alice eventually sets the petname "Carol's blog" for Carol's site, thus the latter link shows as

  • │⁣Carol's blog⁣│⁣/test/page

#URI retrieval

Let gwit://[<VERSION>@]<SITE><PATH> be the URI which identifies a particular file or directory in a gwit site. A gwit client that is to retrieve that resource MUST be able to parse site configuration files (see further above). It MUST first obtain the site identifier <SITE-ID> by removing 0x or 0X from the beginning of <SITE>. There MUST be a Git clone of the site with that identifier in persistent client storage; to that end, the client MUST follow the procedures for initial site retrieval and updates described further above; Git operations described below will operate on that clone.

The client MUST then establish which Git commit <COMMIT> to use, according to the <VERSION> in the URI (which has already been percent-decoded if necessary), by following the first of the steps below whose condition applies:

  1. If <VERSION> is missing or empty, get the commit hash of the head of the site branch <SITE-BRANCH> (derived from <SITE-ID> as described further above) as <COMMIT> (e.g. git show-ref --verify --hash refs/heads/<SITE-BRANCH>).

  2. Else, if <VERSION> matches the format of hashes used by the repository (either 40 or 64 hexadecimal digits for SHA-1 or SHA-256, lower or upper case), use it as <COMMIT>. This is the case for a permanent link.

  3. Else, if <VERSION> consists only of hexadecimal digits (lower or upper case), check that it is the prefix of a single commit object, and use its complete name as <COMMIT> (e.g. git rev-parse --disambiguate=<VERSION> only reports a single <COMMIT>).

    Note: Before resolving the commit name, the client MAY check for Git tags or branches named after <VERSION> (e.g. git show-ref --tags --heads <VERSION>), and warn about any such reference. This may hint about a potential attacker trying to use such a named reference in their public clone to confuse other gwit clients which try to access a URI with that abbreviated commit name as <VERSION>, and tricking them into accessing a different commit. If a legitimate tag or branch needs to be used whose name consists of hexadecimal characters (e.g. the cafe tag), one may use a Git ref namespace prefix (e.g. tags%2fcafe).

  4. Else, if <VERSION> refers to a signed Git tag (e.g. git cat-file -t <VERSION> reports tag) which has a valid signature by the key that matches <SITE-ID> (case-insensitively), or by a subkey of it (e.g. git verify-tag --raw <VERSION> 2>&1 | sed -nE 's/^\[GNUPG:\] VALIDSIG .*\b(\S+)$/\1/p' reports <SITE-ID>), then check that the name in the tag object matches the tag name in <VERSION> and that it does refer to a commit, then get the name of that commit as <COMMIT> (e.g. git rev-parse --abbrev-ref <VERSION> reports <TAG>, and git tag -l --format='%(objecttype) %(tag) %(*objecttype) %(*objectname)' <TAG> reports tag <TAG> commit <COMMIT>).

    Note: The check for the name in the Git tag prevents an attacker from using a reference in their public clone with a name that tricks another gwit client into believing that it is accessing that signed tag, when in fact a different one (though still existing and valid) is being accessed (e.g. by making v1.0 refer to a valid signed tag object containing the name v0.9). Shall the check fail, the client SHOULD report the situation as a potential attack (e.g. to help neutralize the problematic references or remotes).

  5. Else, the client MAY interpret <VERSION> as an arbitrary Git revision pointing to a commit. It SHOULD show a warning about the URI being potentially unsafe, check that the revision refers to a commit object, and use its name as <COMMIT> (e.g. git rev-parse --verify --end-of-options '<VERSION>^{commit}')).

  6. Else fail.

Once the client has established the value of <COMMIT>, it MUST check that <COMMIT> is an ancestor of the head of the site branch (e.g. git merge-base --is-ancestor <COMMIT> <SITE-BRANCH>). Any error or failed check in the previous steps would cause the process to stop at the current step, discard any temporary data, and report an error.

The client MUST then resolve the path <PATH> in the URI (which has already been percent-decoded if necessary) to a file or directory in the Git tree associated with the commit <COMMIT>, by following the steps below, so as to produce some output:

  1. If .gwit/self.ini exists as a file (blob) in the desired commit <COMMIT> (e.g. git ls-tree --format='%(objecttype) %(objectname)' <COMMIT> .gwit/self.ini succeeds and reports blob <CONF-FILE-HASH>), then parse it (e.g. git cat-file blob <CONF-FILE-HASH> | git config -f- …). If it does not exist, treat site configuration as empty for the next steps.

  2. Compute <RELPATH> by replacing repetitions of the forward slash (/) in <PATH> by a single slash, then removing leading and trailing slashes, then removing dot segments according to the remove_dot_segments algorithm described in Section 5.2.4 of RFC3986 (e.g. /foo//../bar/ becomes bar).

    The resulting <RELPATH> is relative to the site's root directory <ROOT> (as per site configuration) and either empty (meaning <ROOT> itself), or it consists of one or more non-empty path components separated by a single slash (for other files or directories).

  3. Check that <ROOT>/<RELPATH> exists in the commit tree, that it resolves (via any symbolic links) to a <TARGET> path also within the tree, and get its type (e.g. echo '<COMMIT>:<ROOT>/<RELPATH>' | git cat-file --batch-check='%(objecttype) %(objectname)' --follow-symlinks reports <TARGET-TYPE> <TARGET-HASH>).

  4. If <TARGET> refers to a file (e.g. <TARGET-TYPE> is blob), then produce its contents (e.g. git cat-file blob <TARGET-HASH>).

    Else, if <TARGET> refers to a directory (e.g. <TARGET-TYPE> is tree), the client SHOULD test if the site configuration defines an index file <INDEX>; if it does, and <TARGET>/<INDEX> resolves to a file (blob) in the commit tree (e.g. echo '<TARGET-HASH>:<INDEX>' | git cat-file --batch-check='%(objecttype) %(objectname)' --follow-symlinks reports blob <INDEX-HASH>), then produce its contents (e.g. git cat-file blob <INDEX-HASH>); if the client does not allow index files, or the index file is undefined, missing or unreadable, then the client SHOULD produce some form of directory listing for the entries in <TARGET> (e.g. from git ls-tree <TARGET-HASH>).

    Else fail.

Any error or failed check in the previous steps would cause the process to stop at the current step, discard any temporary data, and report an error.

When producing or displaying contents on URI retrieval, the gwit client MAY make use of any site configuration value which applies to the chosen version. For instance, it may show the site title (from site.<SITE-ID>.title) or replace site URI prefixes for other publication systems in links (as per site.<SITE-ID>.alt).

Note: Since Git commits are immutable and ancestry checking (e.g. the invocation of git merge-base --is-ancestor) may be an expensive operation, a gwit client MAY keep a cache of commits for which it has already verified that they are ancestors of the current head of the site branch (until it is updated).

#Appendix: Enabling discovery of combined sites via Well-Known URIs

One of gwit's goals is to make existing Web or Gemini static sites easy to publish in parallel as gwit sites. This may be as simple as distributing site files in a Git repository, along with .gwit/self.key and .gwit/self.ini files, and using the key in .gwit/self.key to sign commits.

For a more seamless integration, it should be possible to use the other protocols supported by such a combined site to both identify it as such and get the information needed to then access it over gwit. This information may be found in the files in the .gwit directory. However, since this is always found in the Git repository's top directory, if the site is configured in the other protocol to use some subdirectory <SITE-ROOT> as a root, then those files may not be available via the other protocol's URIs.

A Well-Known URI (RFC8615) MAY be used to provide such site metadata, accessible via the other protocol's /.well-known/gwit.ini URI path, mapping to the repository file <SITE-ROOT>/.well-known/gwit.ini. The format and features of this file are those of a site introduction file (see further above), where the site introduces itself. The file MUST contain exactly one [site "<ID>"] subsection. As with any introduction, the only truly relevant pieces of information are the site ID and the value(s) of site.<ID>.remote (e.g. git config -f … --get-regexp '^site\.0x[0-9a-f]+\.remote$').

An example of such file follows:

[site "0xfedcba98765432100123456789abcdef76543210"]
remote = https://git.example.net/foo/bar-site.git
remote = https://lab.example.org/foo-mirror/bar-site.git

Since the same values of site.<ID>.remote may also appear in a site's configuration file .gwit/self.ini, a site author may make <SITE-ROOT>/.well-known/gwit.ini a relative symbolic link to the former to avoid duplicating information among both files.