~trs-80/ostrta-spec

ref: 75cabc7529cedf7e657b406a31d827b0e9d0cd66 ostrta-spec/Specifications.md -rw-r--r-- 14.5 KiB
75cabc75TRS-80 Add Data File section, flesh out Timestamp-ID specification 10 months ago
  1. Specifications
    1. Controlled Vocabulary
      1. CV File Format
    2. Data File
    3. Filename
      1. Minimum
      2. Full Filename Specification
    4. Filesystem
    5. Timestamp-ID
      1. ostrta-id-N

#Specifications

Here follow (in alphabetical order) some more detailed notes on implementing some of the general concepts.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED","MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

#Controlled Vocabulary

For a conceptual overview, see the Controlled Vocabulary section in General Concepts.

  1. CV item is defined as a contiguous word or term used as an additional axis of metadata. Commonly referred to as a "tag" but that is only one usage, so here we use the more general term.
    1. By contiguous, we mean that spaces MUST NOT be used.

    2. Underscores, camelCase, PascalCase, etc. MAY be used instead within CV items.

#CV File Format

An implementation of the concept of including additional disambiguation notes directly in the same place you are choosing the CV item from, in a simple plain text file format.

Using common example of selecting tag(s), the plain text CV file implementation we propose looks like:

tag1
tag2    <- tag3
tag3    use tag2 instead
  1. Where:

    1. The CV item (i.e., "tag") MUST appear at the beginning of each line.

    2. CV items MUST be separated by newlines.

    3. CV item MAY be followed by OPTIONAL disambiguation notes. If notes follow, they MUST be separated from CV item by at least one space character.

      1. This makes discarding the disambiguation notes from the desired tag (after selection) trivial in many different programming languages.
    4. Redirection from one CV item to another MAY be accomplished by way of simple arrow glyph of "less than and hyphen" (<-).

    5. Other than above extremely simple requirements, you are not only free but actually encouraged to use whatever terms, glyphs, etc. make sense to you personally.

  2. In addition to the above:

    1. Implementations SHOULD provide a user selectable option whether to limit selections strictly to the choices in CV file, or allow adding new items "on the fly."

#Data File

For tabular data meeting the following criteria:

  1. multiple fields / columns
  2. more complicated than what is possible with a CV file
  3. not nested (more than a few levels)
  4. nor otherwise complicated enough to require JSON

…we propose a simple yet dramatic improvement to the common CSV file, a return to using basic ASCII control codes which were expressly designed for the purpose, and have none of the (mostly quote related) parsing and escaping issues of CSV files.1

Seq Dec Hex Abbrev Name
`^\` 28 1C FS File Separator
`^]` 29 1D GS Group Separator
`^^` 30 1E RS Record Separator
`^_` 31 1F US Unit Separator

Above table and below quote are from Wikipedia article C0 and C1 control codes.2

Can be used as delimiters to mark fields of data structures. If used for hierarchical levels, US is the lowest level (dividing plain-text data items), while RS, GS, and FS are of increasing level to divide groups made up of items of the level beneath it.

Therefore we propose:

  1. For many types of tabular data, it is enough to simply use US (^_) instead of the comma delimiter of CSV, and therefore that is what you SHOULD do.
    • In which case, quoting is not required, nor escaping of quotes, eliminating all related parsing issues.
  2. Newline MAY be used as record (row) separator (not to be confused with the above ASCII RS character), in fact it SHOULD be used in the common case of simple, flat tabular data.
  3. "Higher" levels (according to above Wikipedia quote) of escape character delimeters (e.g., RS, GS, FS) SHOULD only be used in cases where additional levels of depth / grouping are required.
  4. When depth / complexity (or other requirements) exceed what this can provide, other common, free and open, and widely supported data formats (e.g., JSON, etc.) SHOULD be used instead.

#Filename

The filename spec is based upon (and closely related to) the timestamp-ID spec.

A simple example (in this case, a photo filename):

YYYY-MM-DD-HHMM_description_text_here--tag1-tag2-tag3_with_spaces.jpg

#Minimum

The minimum file name considered to be following the spec would be a simple ostrta-id-4 with no extension:

YYYY-MM-DD-HHMM

In the Elisp implementation, this simple check is performed by the function ostrta-filename-p, which in turn uses the variable ostrta-id-4-regexp.

#Full Filename Specification

A much more detailed definition:

timestamp-id [_description...] [--[tag...]-another_tag...] [.ext]
  1. timestamp-id is the only strictly required part and therefore MUST follow ostrta-id-4 (at minimum) but MAY achieve higher resolution by following ostrta-id-6, ostrta-id-8, etc. See the timestamp-ID specification for further detail.

  2. description is OPTIONAL but if present MUST start with an underscore (_) delimiter to clearly mark its separation from the timestamp.

    1. The initial delimiter (_) is not considered a part of the description. It is a delimiter.

    2. Illegal characters throughout the file name depend on the file system. Having said that, I think the project SHOULD endeavour to develop a short list which any implementation SHOULD check against when implementing any sort of (re-)naming function(s).

      1. exFAT (common on larger SD cards) for example does not allow {/\:*?\"<>|}
    3. Besides the above, I think we SHOULD NOT use spaces (personally I use underscores instead) but I guess that does not have to be part of spec.

    4. Note that periods (.) MAY be present in description. N.B. how we define filename extension (.ext) below!

  3. tags are also OPTIONAL but if present must start with double hyphen (--) delimiter to clearly separate them from the description.

    1. The initial delimiter (--) SHALL NOT be considered a part of any tags. It is a delimiter.

    2. Within tags, there MAY be spaces, but again, underscores SHOULD be used instead.

    3. Different tags MUST be separated by a hyphen (-) as delimiter.

      1. Corollary to this, individual tags MUST NOT contain hyphens (-).
    4. Note that periods (.) MAY be present in tags. N.B. how we define filename extension (.ext) below!

  4. We define filename extension (.ext) as the last group of legal characters (including letters, numbers, symbols) at the end of the file name after the last period (.).

    1. This means that extensions MAY be arbitrary length. I get a headache just thinking about the potential implications here, so I would welcome feedback from anyone who has more experience dealing with something like this. In particular I wonder if we should limit it to some number of characters.

    2. At the moment nothing really relies on this anyway, but some day it might, hence me trying to come up with a good definition here.

  5. Editing filename after initial creation or processing:

    1. The optional parts of file name (description, tags, etc.) MAY (and should) change!

    2. The timestamp-id portion MUST never change (after initial assignment / processing).

    3. The intention of this rule is to insure the timestamp-id portion of the filename remains a reliable identifier.

Alternatively, you MAY leave the base timestamp-id there by itself (perhaps only along with the extension) and implement your metadata in another index file or even a database (although plain text files are always preferred).3

#Filesystem

I have a lot of ideas about how to organize my home dir. I am sure other people do, too, and therefore I am not sure how many of these ideas are appropriate for this project.

Having said that, at a minimum I think we need to have one or more of the all important timeline structures defined therein. Consider the following as an example to spur discussion, rather than any sort of "standard", certainly for the time being.

One thing in particular I noticed so far is that having the intermediate month folders seemed to be more trouble than it was worth in the ~/tmp directory. So I did away with them there. However in ~/timeline, items are much more numerous, so it's useful to have folders for months because each of those could contain hundreds (or more) of files and additional directories.

~
├── timeline
│   ├── 2016
│   │   ├── 01-Jan
│   │   ├── 02-Feb
│   │   ├── 03-Mar
│   │   ├── 04-Apr
│   │   ├── 05-May
│   │   ├── 06-Jun
│   │   ├── 07-Jul
│   │   ├── 08-Aug
│   │   ├── 09-Sep
│   │   ├── 10-Oct
│   │   ├── 11-Nov
│   │   └── 12-Dec
│   ├── 2017
│   │   └── [...]
│   └── 2018
│       └── [...]
└── tmp
    ├── 2019
    │   ├── 2019-06-08_software_download
    │   └── 2019-12-31_experimental_project
    └── 2020
	├── 2020-04-04_another_temp_dir
	└── 2020-12-18_you_get_the_idea

#Timestamp-ID

Related closely to the base filename spec, and vice-versa.

The Timestamp-ID specification is a very simple "ISO-like" timestamp:

YYYY-MM-DD-HHMMSS
Token Value Format Required?
**YYYY** the year 4 digit MUST
**MM** the month zero padded MUST
**DD** the day zero padded MUST
**HH** the hour 24 hour MUST
**MM** the minute zero padded MUST
**SS** the second zero padded OPTIONAL

Time resolution smaller than one second MAY be defined, but so far there has been no need and thus no discussion what that might look like.

#ostrta-id-N

The notion of -4 and -6 comes from the size of the last group of digits in the timestamp:

Spec name Format Example Resolution
ostrta-id-4 YYYY-MM-DD-HHMM 2021-01-01-2029 minute
ostrta-id-6 YYYY-MM-DD-HHMMSS 2021-01-01-202983 second

Therefore it is an expression of the level of time resolution (minute and second, respectively).

  • Historical note: At one point early on, I was using an underscore between day and time. But then I realized we are still just talking about degrees of time. And since they are all similar (time), I think we should simply stick with hyphens throughout.

#Footnotes

1 Credit for this idea goes to denizens of #emacs, who turned me on to this excellent article.

2 See also the ASCII article, where there is more discussion in Control characters section, and an even better chart featuring additional helpful data.

3 In fact this is the approach I took in the (as yet unreleased) Meme Manager as some memes have far too much metadata to comfortably store in the filename.