~ivilata/gwit-spec

f75a37d607230fb5e82e1a52b7040e3a85c53e54 — Ivan Vilata-i-Balaguer 1 year, 4 months ago 2993f5d
Use a fixed site branch name format, derived from the site key.

This breaking change is justified by the increased complexity and potential
fragility introduced by having to cope with arbitrary site branch names, a
pseudo-feature which has dubious value to the system.

This was a *very* tough change since it may make gwit's use relatively more
inconvenient for authors, but it is nonetheless aligned with gwit's
architectural goals.  Please read the associated issue for more details.

Close the associated issue `avoid-attack-on-bad-branch-name`.
2 files changed, 32 insertions(+), 25 deletions(-)

M README.md
M issues/avoid-attack-on-bad-branch-name.gmi
M README.md => README.md +22 -25
@@ 72,6 72,10 @@ Since gwit is based on Git, a gwit site is made up of *static files and director

## Site requirements

The site branch associated with a gwit site MUST be named `gwit-0x........`, where the dots represent the lower case short ID of the PGP site key, i.e. the last eight hexadecimal digits of the site identifier in lower case. Example: `gwit-0xfedcba98`.

This means that the same Git repository may hold different but related gwit sites, each one in its own branch and with its own key. For instance, while the `master` or `main` branch may contain common sources for a static site generator, generated Gemini and Web files may go to separate gwit site branches.

The different versions of a gwit site that constitute its history are Git commits in the site branch. To use one such commit as a valid site version, its top directory MUST include a `_gwit` directory (underscore `gwit`), which in turn:

- MUST include a `self.key` file containing the site key (and any signing subkeys) in OpenPGP public key format, ASCII-armored or not (e.g. the output of `gpg --export [--armor] <SITE-KEY>`). Although the primary key itself SHOULD NOT change, subsequent updates to `_gwit/self.key` MAY add new subkeys, identities, signatures, revocations and other metadata.


@@ 105,7 109,6 @@ The `site.<ID>` section of `_gwit/self.ini` contains some basic information abou
- `root` (single, optional): A directory to be used as the site's **root directory** instead of, and relative to, the commit's top directory. If missing, it defaults to that top directory. It MUST consist of one or more non-empty path components separated by a single forward slash (`/`). It MUST NOT contain `.` or `..` path components. Convenient when using a static site generator that writes its output to a directory. Example (for a site containing Gemini files): `output`.
- `index` (single, optional): The name of the **index file**. It MUST NOT be empty, `.` or `..`, or contain slash characters (`/`). When a gwit client is told to retrieve a directory, and it contains a file named as the index file, the contents of the file SHOULD be produced instead of a directory listing. Example (for a site containing Gemini files): `index.gmi`.
- `remote` (multiple, recommended): A location recommended by the author for retrieving the site, the URL of a Git remote. Multiple such locations may be given (for increased availability), each as a different `remote` value, which a client MAY consider in order of appearance. Example: `https://git.example.net/foo/bar-site.git`.
- `branch` (single, recommended): The name of the Git repository branch to be used as the site branch. This may allow the branch to contain files output by a static site generator, while `remote`'s default branch (usually `master` or `main`) only contains sources. It takes precedence over the values from introductions to this site (see further below). Example (for a site converted to Gemini files): `gemini-output`.
- `alt` (multiple, optional): If given, the prefix for this site's URIs in a publication system other than gwit. The gwit client MAY interpret links in this site using those prefixes as if they began with a single slash (`/`) instead of the prefix and subsequent slashes. This enables reusing site contents in gwit without needing to adapt local absolute links. Multiple such prefixes may be given, each as a different `alt` value. Example: `https://foo.example.net/bar/` enables rewriting `https://foo.example.net/bar//page.html` to `/page.html`.

The scope of the different site configuration values is described below:


@@ 113,7 116,7 @@ The scope of the different site configuration values is described below:
- `site.<ID>.name`: The value in the latest site version, if defined, SHOULD be read on initial site retrieval, then applied to all versions of the site (past and future, until manually overridden).
- `site.<ID>.title(-*)`, `site.<ID>.desc(-*)`: The value in the latest site version, if defined, MAY be applied to previous versions as well, though the value in a specific version, if defined, SHOULD be applied to that version.
- `site.<ID>.license`, `site.<ID>.root`, `site.<ID>.index`, `site.<ID>.alt`: The value in a specific version, if defined, SHOULD be applied only to that version.
- `site.<ID>.remote`, `site.<ID>.branch`: The value in the latest site version, if defined, MAY be applied on initial site retrieval and updates. The handling of old values is at the discretion of the gwit client.
- `site.<ID>.remote`: The value in the latest site version, if defined, MAY be applied on initial site retrieval and updates. The handling of old values is at the discretion of the gwit client.

This is a sample `_gwit/self.ini` file using all sections and values:



@@ 130,7 133,6 @@ index = index.gmi
remote = https://git.example.net/foo/bar-site.git
remote = https://lab.example.org/foo-mirror/bar-site.git
remote = https://hut.example.org/foo-mirror/bar-site
branch = gemini-output
alt = https://example.net/~foo/bar/
alt = https://foo.example.net/bar/
alt = gemini://foo.example.net/bar-site/


@@ 142,9 144,9 @@ A site's `_gwit` directory may also contain **site introductions**, which allow 

An introduction for a given site MUST be contained in the file `_gwit/<ID>.ini`, where `<ID>` is the identifier of the introduced site, encoded as `0x` plus the lower case hexadecimal digits of the full fingerprint of the PGP site key. Example: `_gwit/0x0123456789abcdef0123456789abcdeffedcba98.ini`.

The format of a site introduction is that of a site configuration file (see further above), with the exception that `site.<ID>.remote` and `site.<ID>.branch` become mandatory values. For any given introduction to a site, its identifier MUST match with the file name `_gwit/<ID>.ini` and its section `[site "<ID>"]`.
The format of a site introduction is that of a site configuration file (see further above), with the exception that `site.<ID>.remote` becomes a mandatory value. For any given introduction to a site, its identifier MUST match with the file name `_gwit/<ID>.ini` and its section `[site "<ID>"]`.

While the values of `site.<ID>.remote` and `site.<ID>.branch` may be used for retrieving the introduced site, the rest of values may be considered as mere hints (since there is no guarantee that they come from that site's author), and they SHOULD be overridden by the client with the equivalent values of the actual site configuration file, once available locally.
While the value of `site.<ID>.remote` may be used for retrieving the introduced site, the rest of values may be considered as mere hints (since there is no guarantee that they come from that site's author), and they SHOULD be overridden by the client with the equivalent values of the actual site configuration file, once available locally.

Also note that a gwit client MAY regard an introduction's `site.<ID>.name` as this site author's proposed name for that site (its edge name); as such, the client SHOULD allow configuring a petname value that overrides it along other proposed names for the site.



@@ 156,7 158,6 @@ name = Someone's site
desc = The site that Someone published while studying at the University.
remote = https://hub.example.com/someone/my-gwit-site.git
remote = https://lab.example.org/s.one/gwit-site.git
branch = published
```

## Site retrieval and content authentication


@@ 167,11 168,10 @@ If someone wants to use a client program to retrieve a gwit site for the first t

- The site identifier, i.e. the site key fingerprint. This MUST be a string of hexadecimal digits.
- The location of an existing copy of the site, accessible to it (either locally or remotely). This MUST be a local file system path or other URL format supported by Git for a remote.
- The name of the site branch. This MUST be a non-empty string allowed by Git as a branch name.

These ID/location/branch sets may be conveyed to the client via different methods (like person-to-person, search engines, or site directories), however this specification only covers a discovery mechanism (described further above) where each site can provide a number of introductions for other sites with their respective ID, location and branch. At any rate, the choice among a variety of available locations for the initial retrieval of a particular site is up to the implementation.
These ID/location pairs may be conveyed to the client via different methods (like person-to-person, search engines, or site directories), however this specification only covers a discovery mechanism (described further above) where each site can provide a number of introductions for other sites with their respective ID and location. At any rate, the choice among a variety of available locations for the initial retrieval of a particular site is up to the implementation.

To retrieve a site for the first time, given `<SITE-ID>` as its identifier, `<SITE-LOCATION>` as its location, and `<SITE-BRANCH>` as its branch, the gwit client MUST clone the Git repository at `<SITE-LOCATION>` and verify that the head of the site branch is signed by the key matching `<SITE-ID>`. An implementation may follow the steps below, or some others with equivalent results:
To retrieve a site for the first time, given `<SITE-ID>` as its identifier, `<SITE-LOCATION>` as its location, and `<SITE-BRANCH>` as its branch (derived from `<SITE-ID>` as described further above), the gwit client MUST clone the Git repository at `<SITE-LOCATION>` and verify that the head of the site branch is signed by the key matching `<SITE-ID>`. An implementation may follow the steps below, or some others with equivalent results:

1. Clone the Git repository from the given location into temporary storage (e.g. `git clone --bare --branch <SITE-BRANCH> <SITE-LOCATION> <TEMP-REPO> && cd <TEMP-REPO>`).
2. Get the commit hash of the head of the site branch as `<HEAD-COMMIT>` (e.g. `git show-ref --verify --hash refs/heads/<SITE-BRANCH>`).


@@ 191,21 191,20 @@ After the previous steps, the client MAY access the `_gwit/self.ini` file in the

### Site updates

If someone wants to retrieve updates to a gwit site identified by `<SITE-ID>` for which they already have a Git clone in persistent client storage, the gwit client MUST choose one of its remotes `<REMOTE>`, fetch new items from it (including site key updates), verify that the new head of the site branch is signed by the key matching `<SITE-ID>` and a successor of its current head, and then point the site branch to its new head. An implementation may follow the steps below, or some others with equivalent results:
If someone wants to retrieve updates to a gwit site identified by `<SITE-ID>` for which they already have a Git clone in persistent client storage, the gwit client MUST choose one of its remotes `<REMOTE>`, fetch new items from it (including site key updates), verify that the new head of the site branch `<SITE-BRANCH>` (derived from `<SITE-ID>` as described further above) is signed by the key matching `<SITE-ID>` and a successor of its current head, and then point the site branch to its new head. An implementation may follow the steps below, or some others with equivalent results:

1. Get the name of the site branch in the clone (referenced by its `HEAD`) as `<BRANCH>` (e.g. `git symbolic-ref --short HEAD`).
2. Get the commit hash of its current head as `<OLD-HEAD>` (e.g. `git show-ref --verify --hash refs/heads/<BRANCH>`).
3. Try to fetch new objects from `<REMOTE>` (e.g. `git fetch --atomic --no-write-fetch-head <REMOTE> '+refs/heads/*:refs/remotes/<REMOTE>/*'`; this preserves all fetch heads for each remote).
4. Get the commit hash of the new head as `<NEW-HEAD>` (e.g. `git show-ref --verify --hash refs/remotes/<REMOTE>/<BRANCH>`).
5. Check that `<NEW-HEAD>` is not an ancestor of the current head (e.g. not `git merge-base --is-ancestor <NEW-HEAD> <OLD-HEAD>`). If it is, then `<REMOTE>` does not contain newer content.
6. Update the site key (e.g. to allow new signing subkeys) by importing the `_gwit/self.key` file into the client's keyring (e.g. `git cat-file blob <NEW-HEAD>:_gwit/self.key | gpg --homedir <CLIENT-GPG-DIR> --import-options merge-only --import`).
7. Check that `<NEW-HEAD>` has a valid signature by the key that matches `<SITE-ID>` (case-insensitively), or by a subkey of it (e.g. `git verify-commit --raw <NEW-HEAD> 2>&1 | sed -nE 's/^\[GNUPG:\] VALIDSIG .*\b(\S+)$/\1/p'` reports `<SITE-ID>`).
8. If the current head is not an ancestor of `<NEW-HEAD>` (e.g. not `git merge-base --is-ancestor <OLD-HEAD> <NEW-HEAD>`), then `<REMOTE>` contains a **site history rewrite**. This scenario is supported by the specification, and this step may or may not succeed depending on different conditions (see further below).
9. Update the head of `<BRANCH>` in the clone to `<NEW-HEAD>` (e.g. `git update-ref refs/heads/<BRANCH> <NEW-HEAD>`).
1. Get the commit hash of the current head of `<SITE-BRANCH>` as `<OLD-HEAD>` (e.g. `git show-ref --verify --hash refs/heads/<SITE-BRANCH>`).
2. Try to fetch new objects from `<REMOTE>` (e.g. `git fetch --atomic --no-write-fetch-head <REMOTE> '+refs/heads/*:refs/remotes/<REMOTE>/*'`; this preserves all fetch heads for each remote).
3. Get the commit hash of the new head as `<NEW-HEAD>` (e.g. `git show-ref --verify --hash refs/remotes/<REMOTE>/<SITE-BRANCH>`).
4. Check that `<NEW-HEAD>` is not an ancestor of the current head (e.g. not `git merge-base --is-ancestor <NEW-HEAD> <OLD-HEAD>`). If it is, then `<REMOTE>` does not contain newer content.
5. Update the site key (e.g. to allow new signing subkeys) by importing the `_gwit/self.key` file into the client's keyring (e.g. `git cat-file blob <NEW-HEAD>:_gwit/self.key | gpg --homedir <CLIENT-GPG-DIR> --import-options merge-only --import`).
6. Check that `<NEW-HEAD>` has a valid signature by the key that matches `<SITE-ID>` (case-insensitively), or by a subkey of it (e.g. `git verify-commit --raw <NEW-HEAD> 2>&1 | sed -nE 's/^\[GNUPG:\] VALIDSIG .*\b(\S+)$/\1/p'` reports `<SITE-ID>`).
7. If the current head is not an ancestor of `<NEW-HEAD>` (e.g. not `git merge-base --is-ancestor <OLD-HEAD> <NEW-HEAD>`), then `<REMOTE>` contains a **site history rewrite**. This scenario is supported by the specification, and this step may or may not succeed depending on different conditions (see further below).
8. Update the head of `<SITE-BRANCH>` in the clone to `<NEW-HEAD>` (e.g. `git update-ref refs/heads/<SITE-BRANCH> <NEW-HEAD>`).

Any error or failed check in the previous steps would cause the process to stop at the current step, discard any temporary data, and report an error. If the Git clone includes additional remotes, the client MAY choose to repeat the procedure with another one in case of error, or to look for newer content.

After the previous steps, the client MAY access the `_gwit/self.ini` file in `<NEW-HEAD>` commit (e.g. `git cat-file blob <NEW-HEAD>:_gwit/self.ini`) and apply any relevant configuration values (see further above). In particular, a change in `site.<SITE-ID>.remote` or `site.<SITE-ID>.branch` MAY trigger another update with the new values (e.g. after `git remote set-url origin <NEW-REMOTE>`, or `git update-ref refs/heads/<NEW-BRANCH> refs/remotes/<REMOTE>/<NEW-BRANCH>` and `git symbolic-ref HEAD refs/heads/<NEW-BRANCH>`, respectively).
After the previous steps, the client MAY access the `_gwit/self.ini` file in `<NEW-HEAD>` commit (e.g. `git cat-file blob <NEW-HEAD>:_gwit/self.ini`) and apply any relevant configuration values (see further above). In particular, a change in `site.<SITE-ID>.remote` MAY trigger another update with the new value (e.g. after `git remote set-url origin <NEW-REMOTE>`).

### Site history rewrites



@@ 243,7 242,7 @@ with parts in square brackets being optional, and where
  Links found inside of a gwit site may also use the string `self` (case-insensitive) for `<SITE>`, which allows the site to easily link to a particular version of itself (i.e. `gwit://<VERSION>@self<PATH>…`). When parsing such a URI, a gwit client MUST first replace `self` with the site identifier as described above. URIs using `self` MUST NOT be allowed outside of a site, and a gwit client SHOULD replace `self` with the site identifier when exporting them (e.g. when copying the URI to the clipboard).
- `<VERSION>`, when present and not empty, specifies a particular version of the target site. It is the object name (hash) of a Git commit in the site's history, encoded as a string of hexadecimal digits (case-insensitive). The name may be shortened by removing characters from its end, but this may cause content retrieval to fail if the client's Git clone of the site contains several commits with that same shortened name.

  When `<VERSION>` is missing or empty, the URI refers to whatever site version is most recent to a client when it accesses the site (i.e. the head of the site branch in the client's Git clone of the site, referenced by its `HEAD`).
  When `<VERSION>` is missing or empty, the URI refers to whatever site version is most recent to a client when it accesses the site (i.e. the head of the site branch in the client's Git clone of the site).

  `<VERSION>` may also be the name of a Git tag or branch in the client's Git clone of the site, percent-encoded if it contains reserved characters (as per Sections 2.1 and 2.2 of RFC3986). A tag signed by the site key may be used to succintly convey a relevant point in the history of a site (like a release name). In contrast, branches and unsigned tags cannot be authenticated and their names may vary between clients, thus URIs using them SHOULD NOT be published in the general case, though gwit clients MAY support them as they can be useful for local debugging or internal site authoring.
- `<PATH>` is the absolute path of a file or directory in the site version referenced by the previous parts of the URI. The root of the path is the site's root directory (as per site configuration) in the Git commit corresponding to the desired site version, so the path maps to a Git blob object or tree object reachable from it.


@@ 316,7 315,7 @@ Let `gwit://[<VERSION>@]<SITE><PATH>` be the URI which identifies a particular f

The client MUST then establish which Git commit `<COMMIT>` to use, according to the `<VERSION>` in the URI (which has already been percent-decoded if necessary), by following the first of the steps below whose condition applies:

1. If `<VERSION>` is missing or empty, get the commit hash of the head of the site branch (referenced by `HEAD`) as `<COMMIT>` (e.g. `git show-ref --verify --hash HEAD`).
1. If `<VERSION>` is missing or empty, get the commit hash of the head of the site branch `<SITE-BRANCH>` (derived from `<SITE-ID>` as described further above) as `<COMMIT>` (e.g. `git show-ref --verify --hash refs/heads/<SITE-BRANCH>`).
2. Else, if `<VERSION>` matches the format of a SHA-1 hash (40 hexadecimal digits, lower or upper case), use it as `<COMMIT>`. This is the case for a permanent link.
3. Else, if `<VERSION>` consists only of hexadecimal digits (lower or upper case), check that there is neither tag nor branch with that name (e.g. `git show-ref --tags --heads <VERSION>` reports nothing), then check that it is the prefix of a single commit object, and use its complete name as `<COMMIT>` (e.g. `git rev-parse --verify <VERSION>^{commit}` only reports the `<COMMIT>`).



@@ 358,7 357,6 @@ A Well-Known URI ([RFC8615][]) MAY be used to provide such site metadata, access
    "Well-Known Uniform Resource Identifiers (URIs) (RFC 8615)"

- `remote` (multiple, mandatory): The URL of a Git remote from where the site can be retrieved; equivalent to `site.<ID>.remote` in the site configuration file.
- `branch` (single, mandatory): The branch of the Git repository to be used as the default branch for the site; equivalent to `site.<ID>.branch` in the site configuration file.

Which is effectively a site's introduction to itself. Other sections and values SHOULD be ignored. An example of such file follows:



@@ 366,7 364,6 @@ Which is effectively a site's introduction to itself. Other sections and values 
[site "0xfedcba98765432100123456789abcdef76543210"]
remote = https://git.example.net/foo/bar-site.git
remote = https://lab.example.org/foo-mirror/bar-site.git
branch = gemini-output
```

Since the same values of `site.<ID>.remote` and `site.<ID>.branch` may also appear in a site's configuration file, and unknown values in `<SITE-ROOT>/.well-known/gwit.ini` are ignored, a site author may make the latter a symbolic link to the former to avoid duplicating information among both files.
Since the same values of `site.<ID>.remote` may also appear in a site's configuration file, and unknown values in `<SITE-ROOT>/.well-known/gwit.ini` are ignored, a site author may make the latter a symbolic link to the former to avoid duplicating information among both files.

M issues/avoid-attack-on-bad-branch-name.gmi => issues/avoid-attack-on-bad-branch-name.gmi +10 -0
@@ 26,3 26,13 @@ For instance, with a fixed "gwit" branch, the author of a new site may just rena
To support multiple gwit sites in a single Git repository (e.g. one Web version and one Gemini version generated from common sources, which is indeed desirable for compatibility reasons), "gwit" would be insufficient, and the branch name should derive from the site key (so it is deterministic, not arbitrary). This may also be unfriendly with the author, and prone to branch misnaming errors; however these would not break client operation, and fixing them would just be a matter of renaming the branch and pushing at any later moment.

For instance, possible candidates for the branch name format would be "gwit-0xfedcba98765432100123456789abcdef76543210" (full ID), "gwit-0x89abcdef76543210" (PGP long ID), "gwit-0x76543210" (PGP short ID), or without the "0x", or just with enough digits that a collision between site IDs in the same repository is sufficiently unlikely; however, using a proper PGP ID in the name may be more useful.

## Resolution

Because of the design, specification and implementation complexity introduced by allowing arbitrary site branch names, whether already known (see above) or yet unforeseen (as the "feature" was not properly specified), and in spite of the relative inconveniences that it may introduce for site authors, the spec has been updated to require a fixed site branch name derived from the site ID. The chosen format is like "gwit-0x76543210", i.e. the lower case short ID of the PGP site key, as the short ID is easy to identify visually (because of the presence of "0x", even if it matches a word), it may be copied & pasted whenever a PGP key ID is required, and it is short enough but more than sufficient to differentiate site keys in the same Git repository.

Regarding the inconveniences, the change goes in line with the "Simplicity and human scale" and "Feature minimalism" architectural goals, which have higher priorities than "Ease of adoption" (at least as of 2023-08-02). That is, allowing arbitrary site branch names may ease adoption, but at the cost of introducing unneeded features and complicating the system.

In case the author sets a wrong site branch name, the problem is easy to spot with support from the community ("Humanism above techno-solutionism" architectural goal), and they only have to rename it (e.g. "git branch -m gwit-0x...") and push for other clients to be able to update their clones and access the correct branch successfully (in fact, since branches are not signed, anyone would be able to fix their clones locally). Please note that the update makes the remotes' default branches (i.e. their "HEAD") totally irrelevant, so the author does not even need to fix it in the remotes they push to; this also allows someone hosting a public repository to use it with work trees (checkouts) and alternative branches freely, as long as they do not update the site branch.

* closed