~trs-80/ostrta-spec

ref: 75cabc7529cedf7e657b406a31d827b0e9d0cd66 ostrta-spec/Specifications.org -rw-r--r-- 12.8 KiB
75cabc75TRS-80 Add Data File section, flesh out Timestamp-ID specification 10 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
* Specifications
  :PROPERTIES:
  :CUSTOM_ID:            specifications
  :END:

Here follow (in alphabetical order) some more detailed notes on implementing some of the [[file:README.org::#general-concepts][general concepts]].

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED","MAY", and "OPTIONAL" in this document are to be interpreted as described in [[https://tools.ietf.org/html/rfc2119][RFC 2119]].

** Controlled Vocabulary
   :PROPERTIES:
   :CUSTOM_ID:            controlled-vocabulary
   :END:

For a conceptual overview, see the [[file:README.org::#controlled-vocabulary][Controlled Vocabulary]] section in General Concepts.

1. *CV item* is defined as a contiguous word or term used as an additional axis of metadata.  Commonly referred to as a "tag" but that is only one usage, so here we use the more general term.

   1. By contiguous, we mean that spaces MUST NOT be used.

   2. Under_scores, camelCase, PascalCase, etc. MAY be used instead within CV items.

*** CV File Format
    :PROPERTIES:
    :CUSTOM_ID:            cv-file-format
    :END:

An implementation of the [[file:README.org::#disambiguation-notes-live-directly-with-cv-items][concept]] of including additional disambiguation notes directly in the same place you are choosing the CV item from, in a simple plain text file format.

Using common example of selecting tag(s), the plain text CV file implementation we propose looks like:

#+begin_example
  tag1
  tag2    <- tag3
  tag3    use tag2 instead
#+end_example

1. Where:

   1. The CV item (i.e., "tag") MUST appear at the beginning of each line.

   2. CV items MUST be separated by newlines.

   3. CV item MAY be followed by OPTIONAL disambiguation notes.  If notes follow, they MUST be separated from CV item by at least one space character.

      1. This makes discarding the disambiguation notes from the desired tag (after selection) trivial in many different programming languages.

   4. Redirection from one CV item to another MAY be accomplished by way of simple arrow glyph of "less than and hyphen" (=<-=).

   5. Other than above extremely simple requirements, you are not only free but actually encouraged to use whatever terms, glyphs, etc. make sense /to you personally/.

2. In addition to the above:

   1. Implementations SHOULD provide a user selectable option whether to limit selections strictly to the choices in CV file, or allow adding new items "on the fly."

** Data File

For tabular data meeting the following criteria:

1. multiple fields / columns
2. more complicated than what is possible with a [[#cv-file-format][CV file]]
3. not nested (more than a few levels)
4. nor otherwise complicated enough to require JSON

...we propose a simple yet dramatic improvement to the common CSV file, a return to using basic ASCII control codes which were expressly designed for the purpose, and have none of the (mostly quote related) parsing and escaping issues of CSV files.[fn:1]

|-----+-----+-----+--------+------------------|
| Seq | Dec | Hex | Abbrev | Name             |
|-----+-----+-----+--------+------------------|
| =^\=  |  28 | 1C  | FS     | File Separator   |
| =^]=  |  29 | 1D  | GS     | Group Separator  |
| =^^=  |  30 | 1E  | RS     | Record Separator |
| =^_=  |  31 | 1F  | US     | Unit Separator   |
|-----+-----+-----+--------+------------------|

Above table and below quote are from Wikipedia article [[https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_separators][C0 and C1 control codes]].[fn:2]

#+begin_quote
Can be used as delimiters to mark fields of data structures. If used for hierarchical levels, US is the lowest level (dividing plain-text data items), while RS, GS, and FS are of increasing level to divide groups made up of items of the level beneath it.
#+end_quote

Therefore we propose:

1. For many types of tabular data, it is enough to simply use US (=^_=) instead of the comma delimiter of CSV, and therefore that is what you SHOULD do.
   - In which case, quoting is not required, nor escaping of quotes, eliminating all related parsing issues.
2. Newline MAY be used as record (row) separator (not to be confused with the above ASCII RS character), in fact it SHOULD be used in the common case of simple, flat tabular data.
3. "Higher" levels (according to above Wikipedia quote) of escape character delimeters (e.g., RS, GS, FS) SHOULD only be used in cases where additional levels of depth / grouping are required.
4. When depth / complexity (or other requirements) exceed what this can provide, other common, free and open, and widely supported data formats (e.g., JSON, etc.) SHOULD be used instead.

** Filename
   :PROPERTIES:
   :CUSTOM_ID:            filename
   :END:

The filename spec is based upon (and closely related to) the [[#timestamp-id][timestamp-ID]] spec.

A simple example (in this case, a photo filename):

#+begin_example
  YYYY-MM-DD-HHMM_description_text_here--tag1-tag2-tag3_with_spaces.jpg
#+end_example

*** Minimum
    :PROPERTIES:
    :CUSTOM_ID:            minimum
    :END:

The minimum file name considered to be following the spec would be a simple [[#ostrta-id-n][ostrta-id-4]] with no extension:

#+begin_example
  YYYY-MM-DD-HHMM
#+end_example

In the Elisp implementation, this simple check is performed by the function =ostrta-filename-p=, which in turn uses the variable =ostrta-id-4-regexp=.

*** Full Filename Specification
    :PROPERTIES:
    :CUSTOM_ID:            full-filename-specification
    :END:

A much more detailed definition:

#+begin_example
  timestamp-id [_description...] [--[tag...]-another_tag...] [.ext]
#+end_example

1. *timestamp-id* is the only strictly required part and therefore MUST follow ostrta-id-4 (at minimum) but MAY achieve higher resolution by following ostrta-id-6, ostrta-id-8, etc.  See the [[#timestamp-id][timestamp-ID]] specification for further detail.

2. *description* is OPTIONAL but if present MUST start with an underscore (=_=) delimiter to clearly mark its separation from the timestamp.

   1. The initial delimiter (=_=) is not considered a part of the description.  It is a delimiter.

   2. Illegal characters throughout the file name depend on the file system.  Having said that, I think the project SHOULD endeavour to develop a short list which any implementation SHOULD check against when implementing any sort of (re-)naming function(s).

      1. exFAT (common on larger SD cards) for example does not allow {=/\:*?\"<>|=}

   3. Besides the above, I think we SHOULD NOT use spaces (personally I use underscores instead) but I guess that does not have to be part of spec.

   4. Note that periods (=.=) MAY be present in description.  N.B. how we define filename extension (.ext) below!

3. *tags* are also OPTIONAL but if present must start with double hyphen (=--=) delimiter to clearly separate them from the description.

   1. The initial delimiter (=--=) SHALL NOT be considered a part of any tags.  It is a delimiter.

   2. Within tags, there MAY be spaces, but again, underscores SHOULD be used instead.

   3. Different tags MUST be separated by a hyphen (=-=) as delimiter.

      1. Corollary to this, individual tags MUST NOT contain hyphens (=-=).

   4. Note that periods (=.=) MAY be present in tags.  N.B. how we define filename extension (.ext) below!

4. We define filename extension (*.ext*) as the last group of legal characters (including letters, numbers, symbols) at the end of the file name after the last period (=.=).

   1. This means that extensions MAY be arbitrary length.  I get a headache just thinking about the potential implications here, so I would welcome feedback from anyone who has more experience dealing with something like this.  In particular I wonder if we should limit it to some number of characters.

   2. At the moment nothing really relies on this anyway, but some day it might, hence me trying to come up with a good definition here.

5. Editing filename after initial creation or processing:

   1. The optional parts of file name (description, tags, etc.) MAY (and /should)/ change!

   2. The timestamp-id portion MUST never change (after initial assignment / processing).

   3. The intention of this rule is to insure the timestamp-id portion of the filename remains a reliable identifier.

Alternatively, you MAY leave the base timestamp-id there by itself (perhaps only along with the extension) and implement your metadata in another index file or even a database (although plain text files are always [[file:README.org::#relying-strictly-on-floss-and-lowest-common-denominator-formats][preferred]]).[fn:3]

** Filesystem
   :PROPERTIES:
   :CUSTOM_ID:            filesystem
   :END:

I have a lot of ideas about how to organize my home dir.  I am sure other people do, too, and therefore I am not sure how many of these ideas are appropriate for this project.

Having said that, at a minimum I think we need to have one or more of the all important timeline structures defined therein.  Consider the following as an example to spur discussion, rather than any sort of "standard", certainly for the time being.

One thing in particular I noticed so far is that having the intermediate month folders seemed to be more trouble than it was worth in the =~/tmp= directory.  So I did away with them there.  However in =~/timeline=, items are much more numerous, so it's useful to have folders for months because each of those could contain hundreds (or more) of files and additional directories.

#+begin_example
  ~
  ├── timeline
  │   ├── 2016
  │   │   ├── 01-Jan
  │   │   ├── 02-Feb
  │   │   ├── 03-Mar
  │   │   ├── 04-Apr
  │   │   ├── 05-May
  │   │   ├── 06-Jun
  │   │   ├── 07-Jul
  │   │   ├── 08-Aug
  │   │   ├── 09-Sep
  │   │   ├── 10-Oct
  │   │   ├── 11-Nov
  │   │   └── 12-Dec
  │   ├── 2017
  │   │   └── [...]
  │   └── 2018
  │       └── [...]
  └── tmp
      ├── 2019
      │   ├── 2019-06-08_software_download
      │   └── 2019-12-31_experimental_project
      └── 2020
	  ├── 2020-04-04_another_temp_dir
	  └── 2020-12-18_you_get_the_idea
#+end_example

** Timestamp-ID
   :PROPERTIES:
   :CUSTOM_ID:            timestamp-id
   :END:

Related closely to the base [[#filename][filename]] spec, and vice-versa.

The Timestamp-ID specification is a very simple "ISO-like" timestamp:

#+begin_example
  YYYY-MM-DD-HHMMSS
#+end_example

|-------+------------+-------------+-----------|
| Token | Value      | Format      | Required? |
|-------+------------+-------------+-----------|
| *YYYY*  | the year   | 4 digit     | MUST      |
| *MM*    | the month  | zero padded | MUST      |
| *DD*    | the day    | zero padded | MUST      |
| *HH*    | the hour   | 24 hour     | MUST      |
| *MM*    | the minute | zero padded | MUST      |
| *SS*    | the second | zero padded | OPTIONAL  |
|-------+------------+-------------+-----------|

Time resolution smaller than one second MAY be defined, but so far there has been no need and thus no discussion what that might look like.

*** ostrta-id-N
    :PROPERTIES:
    :CUSTOM_ID:            ostrta-id-n
    :END:

The notion of =-4= and =-6= comes from the size of the last group of digits in the timestamp:

|-------------+-------------------+-------------------+------------|
| Spec name   | Format            | Example           | Resolution |
|-------------+-------------------+-------------------+------------|
|             |                   | <l>               |            |
| ostrta-id-4 | YYYY-MM-DD-HHMM   | 2021-01-01-2029   | minute     |
| ostrta-id-6 | YYYY-MM-DD-HHMMSS | 2021-01-01-202983 | second     |
|-------------+-------------------+-------------------+------------|

Therefore it is an expression of the level of time resolution (minute and second, respectively).

- Historical note: At one point early on, I was using an underscore between day and time.  But then I realized we are still just talking about degrees of time.  And since they are all similar (time), I think we should simply stick with hyphens throughout.

* Footnotes
   :PROPERTIES:
   :CUSTOM_ID:            footnotes
   :END:

[fn:1] Credit for this idea goes to denizens of =#emacs=, who turned me on to [[https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/][this]] excellent article.

[fn:2] See also the [[https://en.wikipedia.org/wiki/ASCII][ASCII]] article, where there is more discussion in [[https://en.wikipedia.org/wiki/ASCII#Control_characters][Control characters]] section, and an even better [[https://en.wikipedia.org/wiki/ASCII#Control_code_chart][chart]] featuring additional helpful data.

[fn:3] In fact this is the approach I took in the (as yet unreleased) Meme Manager as some memes have far too much metadata to comfortably store in the filename.