~jojo/Carth

ref: db87ea9bdce9470b546bb3098f5d79a9b1023af1 Carth/TODO.org -rw-r--r-- 18.3 KiB
db87ea9bJoJo Disclaim WIP status in readme 1 year, 8 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
#+TITLE: The Carth programming language

Features and other stuff to do/implement in/around Carth.

*IMPORTANT*: When done implementing a TODO, do one or a combination of
the following three things, depending on complexity of the TODO and
the fix etc:

1. Delete the todo. Especially appropriate if the todo is a simple
   bugfix or such. Also, link to the commit that fixes the bug or
   whatever in the message of the commit that removes the todo.

2. Document the changes in the [[https://gitlab.com/JoJoZ/carth-website/tree/master/pages/reference.org][reference]]. Should be done when any kind
   of user-facing feature is implemented, or some other "major" change
   has been done.

3. Keep the todo, and mark it as *DONE* with a description of what was
   done and how it went. This should only be done seldomly I
   feel. Also, link to the commit that fixes the TODO would be nice.

* INACTIVE Package system

* NEXT Module system
  Postfix syntax for module paths? A bit like web-domains -
  "sub.main.top". E.g. "vector.collections.std".  Most relevant
  information floats to the left. Maybe a good idea, maybe
  not. Consider it.
** INACTIVE Allow conflicting imports if unambiguous?
   I'm thinking something that would allow the following. It would be
   less annoying than having to qualify everything. Also, gotta think
   about how this relates to overloading à la C++.

   #+BEGIN_SRC carth
   (module Foo
           (data FooThing First Second)
           (define: isFirst
               (Fun FooThing Bool)
             (fun-match
               [First True]
               [Second False])))

   (module Bar
           (data BarThing First Second)
           (define: isFirst
               (Fun BarThing Bool)
             (fun-match
               [First True]
               [Second False])))

   ;; First, there should be no error for just importing modules with conflicting
   ;; defs. This is ok in Haskell, unless one of the conflicting defs is used.
   (import Foo)
   (import Bar)

   ;; Second, it should be allowed to use one of a set of conflicting defs if the
   ;; type makes it unambiguous....

   ;; either explicitly
   (define: x FooThing First)
   (define: y BarThing First)

   ;; or implicitly
   (define t (isFirst x))
   (define u (isFirst y))
   #+END_SRC

* TODO Algebraic datatypes
  Construction, pattern matching, the whole shebang.

** DONE Basic definition, construction and pattern matching
*** Done
    We can now define algebraic datatypes with the ~(type ...)~
    special form, and construct and destruct / pattern match over
    expressions of such a type. They may be polymorphic, and the
    constructor-expressions are treated as curried functions by the
    type-checker.

    Some details were not implemented in this step. Exhaustive pattern
    checking, and handling of recursive datatypes.

** DONE Implement pattern matching w decision trees
   Slower to compile, faster to execute
** DONE Check patterns for redundancy and exhaustiveness
   For a datatype ~(type Foo Bar Baz)~, following are some examples of
   pattern matches and expected results:

   - ~(match x [Bar y] [Baz z])~ :: is exhaustive and non-reduntant, and should pass.
   - ~(match x [Bar y] [xx z])~ :: is exhaustive and non-redundant, and should pass.
   - ~(match x [Bar y])~ :: is inexhaustive and should produce an error.
   - ~(match x [xx z] [Bar y])~ :: is exhaustive but redundant, and should produce an error or warning.

   Read: /Compiling Pattern Matching to Good Decision Trees/.

*** Complete
    Made an almost 1:1 implementation of Sestoft's algorithms. It's
    good.

** DONE Handle recursive datatypes
   Consider the linked list ~(type (List a) Nil (Cons a (List a)))~,
   how should this type be represented in memory? If we just naively
   represent it as a flat structure without any indirection, it will
   be dynamically sized. This is of course no good, so we need to put
   the recursion behind some kind of pointer-indirection. The two main
   ways of doing this are (a): automatically box recursive type calls,
   like I assume Haskell does, or (b): require the user to explicitly
   rewrite the type to use pointer-indirection, like in Rust.

   Which method we use will depend on whether we want to focus in on
   high-level, garbage-collected approach of Haskell, or the
   lower-level, borrow-checked approach of Rust. I'm leaning towards
   the latter.

   Example of Rust error message:
   #+BEGIN_EXAMPLE
     |
   1 | struct Foo(Foo);
     | ^^^^^^^^^^^---^^
     | |          |
     | |          recursive without indirection
     | recursive type has infinite size
     |
     = help: insert indirection (e.g., a `Box`, `Rc`, or `&`) at some
             point to make `Foo` representable
   #+END_EXAMPLE

*** Completed
    Decided to go with explicit ~Box~ type and ~box~/~deref~ syntax, at
    least for now. ~Box~ is a heap-allocated pointer which currently
    leaks, but should probably be GCd. A datatype definition recursive
    without ~Box~ indirection will now throw an error.

** TODO Document in reference
** NEXT Uninhabited types
   Definition and pattern-matching of.
* NEXT Typeclasses
** Agda style classes w implicit args
   In Haskell, you can only have a single instance of a specific
   typeclass for a specific type. This doesn't always make
   sense. Consider Semigroup for Int. Both + and * make sense, but we
   can only have one unless we goof around with newtypes etc, and that
   kinda sucks.

   Consider an approach more like agda. That model is more lika basic
   Hindley-Milner + dictionsry passing, except the "typeclass"
   argument can be passed implicitly with the {} syntax! That seems
   really cool.

   I'm not sure how implicit arguments work though. Does the compiler
   just look at all available bindings and pick the first/only
   available variable of that type?

   https://agda.readthedocs.io/en/v2.5.2/language/implicit-arguments.html

   https://agda.readthedocs.io/en/v2.5.2/language/instance-arguments.html

   Or just do it kind of Haskell style, but give the instances names
   and allow multiple, overlapping instances, raisi g an error if the
   instance is ambiguous somehow.

   Problem with instances as implicit arguments:
   https://youtu.be/2EdQFCP5mZ8?t=1259.  We'd have to know exactly
   which instances exist for the same type, and from where they're
   imported and what scoping they'll have. That sucks. Another
   horrible thing: imagine creating a sorted list with one instance, and doing
   a sorted lookup with another (accidentally or not), you could an incorrect
   result with no error from the compiler!

   Maybe an alternative could be to have both ~primary~ and
   ~secondary~ instances, where the primary instances may not overlap
   or be orphaned, like Rust, but may be passed implicitly, while
   secondary instances may overlap and be orphaned, but must be
   "overriden"/passed explicitly.

   But that may also not work. For the following code,

   #+BEGIN_SRC haskell
   foo :: Foo a => a -> a
   foo = bar

   bar :: Foo a => a -> a
   bar = ...
   #+END_SRC

   consider that we call ~foo~ with an explicit secondary
   instance. What instance will ~bar~ be given? If we must pass
   secondary instances explicitly, it seems ~bar~ would get the
   primary instance, and ~foo~ and ~bar~ would be called with
   different instances. BAD!

   Probably last update for this section: [[https://old.reddit.com/r/haskell/comments/765ogm/multiple_type_class_instances_for_the_same_type/][this thread]] has convinced me
   that Haskell-/Rust-style typeclasses is the best idea.

* INACTIVE Linear types
  Linear types would allow predictable performance and behaviour of
  e.g. IO tasks. Force a single manual file-close or
  buffer-flush. Force a single free for malloc.  Affine types would
  allow better performance.  E.g. pure, in-place modification of
  array.  If noone else points to it, value can be consumed and
  modified rather than cloned. Something like: ~fn push(mut v:
  Vec<i32>, x: i32) -> Vec<i32> { v.push(x); v }~ Implemented as maybe
  a wrapper, or an interface?  Maybe like in haskell with lolly
  operator?  [[http://docs.idris-lang.org/en/latest/reference/uniqueness-types.html][Check out idris Uniqueness Types]]

* NEXT Higher kinded types

* INACTIVE Type families / functional dependencies and multi-param classes / Dependent types
  I'm on the fence here, but the consensus seems to be that type
  families are better than fundeps. Also, it might be possible to
  avoid needing to implement Multi-parameter typeclasses if type
  families are available to compensate. Seems that would reduce
  ambiguities and mental overhead a bit.

  Neither type families or fundeps are necessary if we have dependent
  types, but that would likely bring difficulties of it's own.

  Type families in Haskell vs Dependent types in a pseudo-Haskell vs
  Dependent types in Agda:

** Type families, Haskell
   #+BEGIN_SRC haskell
   class Iter c where
       type Item c
       next :: c -> Maybe (Item c, c)

   nextList :: [a] -> Maybe (a, [a])
   nextList = \case
       [] -> Nothing
       a : as -> Just (a, as)

   instance Iter [a] where
       type Item [a] = a
       next = nextList
   #+END_SRC

** Dependent types, pseudo-Haskell
   #+BEGIN_SRC haskell
   class Iter c where
       item :: Type
       next :: c -> Maybe (item, c)

   nextList :: [a] -> Maybe (a, [a])
   nextList = \case
       [] -> Nothing
       a : as -> Just (a, as)

   instance Iter [a] where
       item = a
       next = nextList
   #+END_SRC

** Dependent types, Agda
   #+BEGIN_SRC agda2
   record Iter (C : Set) : Set1 where
     field
       item : Set
       next : C -> Maybe (item × C)

   nextList : {A : Set} -> List A -> Maybe (A × List A)
   nextList [] = nothing
   nextList (x ∷ xs) = just (x , xs)

   listIter : {A : Set} -> Iter (List A)
   listIter {a} = record
     { item = a
     ; next = nextList
     }
   #+END_SRC

* Memory model
  How do we handle the heap? Garbage collection like Haskell?
  Ownership and borrowing like Rust? Something in between?

  Should heap allocations be explicit or implicit? Even if we go with
  a Haskell-like model, should there be an explicit ~Box a~ type?
** NEXT Consider something Rust-like
  I.e. affine/linear types, lifetimes, little/no GC by default.
  Would allow writing real-time applications like games.

  E.g. GHC seems to prefer throughput over latency, so very long
  pauses are possible when you're working with a nontrial amount of
  data. "You're actually doing pretty well to have a 51ms pause time
  with over 200Mb of live data.".

  Lifetimes could fit in with Higher Kinded Types quite
  naturally. Instead of just having the kind ~*~ (aka. ~type~), you'd
  have two kinds: ~type~ and ~lifetime~. You could then have a type
  like ~Ref 'a Int~ where ~Ref~ is a type operator with kind ~lifetime
  -> type -> type~.

  Another option could be to add ways of controlling when GC happens
  so you can reduce spikes of latency. Haskell has ~performGC :: IO
  ()~ that does this. [[https://old.reddit.com/r/haskell/comments/6d891n/has_anyone_noticed_gc_pause_lag_in_haskell/di0vqb0/][Here is a gameboy]] who eliminates spikes at the
  cost of overall performance by calling ~performGC~ every frame.

  [[https://github.com/rust-lang/rfcs/blob/master/text/1598-generic_associated_types.md][Some inspiration here]].

** Garbage collector
   Until we get linear types, and probably even then, we'll need some
   form of GC.

   There are many problems with refcounting: Generated llvm ir/asm gets
   polluted; While performance is more predictable, it's typically
   worse overall; Cycle breaking would either require using weak refs
   where appropriate, which would in turn require user input or an
   advanced implementation, or a periodic cycle breaker, which would be
   costly performance wise. So tracing GC is probably a good idea.

*** NEXT Boehms GC
    Simplest way to get rudimentary, but decently performant, GC.

*** INACTIVE DIY Garbage collector
    A tracing GC would be quite separate from the rest of the
    program. The only pollution would be calls to the allocator (not
    much different from the current sitch w malloc) and
    (de)registrations of local variables in Let forms (a total of two
    function calls per heap allocated variable).

    Implementing a tracing GC would also be a fun challenge, and I'm
    sure it could be fun to try different algorithms etc.

**** How it would work
     Basically, instead of calling =malloc=, the alloc function of the
     GC is called. This function keeps track of either the number of
     calls, the time, or the current sum of allocated space, and
     periodically performs a mark-and-sweep, walking through the object
     graph and marking objects not directly or indirectly referenced by
     a "root" node for sweeping.

     Root nodes are global variables and all local variables visible in
     the current scope. Global variables can be registered in the main
     wrapper, while local variables could be registered right after
     they've been created (in a Let, Match, ...). They would then be
     unregistered right before the function returns (or in the case of
     tail calls, right before the tail call). Registering could happen
     directly in the GC alloc routine.

** Merging affine/linear types and GC
   Best of both worlds? Maybe.

   I don't think I want memory management to be quite as explicit and
   cumbersome as in Rust, especially wrt lifetimes. An alternative
   could be to just add linear types to allow for structures that
   require mutability, like HashMap, but not borrowing. This would not
   enable us to write *the most* performant code, but we'd be able to
   do a lot better than with just GC--games may be quite possible.
* INACTIVE Effect system

* INACTIVE Macros?

* INACTIVE Property system
  I'm thinking of a system where you annotate functions in a source
  file with pre- and postconditions, which can then be checked in
  different modes depending on how much time you've got etc.

  - Proof-mode. Exchaustive checking of conditions. All possible
     inputs are generated, and the system checks that the precondition
     always implies the postcondition.
  - Test-mode. Statistical, random testing. Generate enough inputs
    such that the precondition is fulfilled for a statistically
    significant subset of the complete set of possible inputs.
  - Debug-mode. Functions are not tested ahead of time, instead
     assertions are inserted and checked at runtime.
  - Release-mode. Conditions are completely ignored.

* NEXT Consider using lib for pretty printing
  https://hackage.haskell.org/package/pretty-1.1.1.1

* INACTIVE Hoogle equivalent
  https://wiki.haskell.org/Hoogle

* INACTIVE Web playground
  Like play.rustlang.org

* INACTIVE Language server protocol
  [[https://github.com/Microsoft/language-server-protocol]]
  [[https://internals.rust-lang.org/t/introducing-rust-language-server-source-release/4209]]

* NEXT Reference
  Rust has a [[https://doc.rust-lang.org/reference/][good reference]]. Look at that for inspiration.

** INACTIVE Document syntax

** INACTIVE Document type system

** INACTIVE Document memory model

* NEXT Continuous deployment of webpage
  at [[https://carth.jo.zone/]] or some other place.

* INACTIVE HTML documentation generation
  Like [[https://www.haskell.org/haddock/][haddock]] and [[https://www.haskell.org/haddock/][rustdoc]].

* INACTIVE Documentation checker
  Like a typechecker-pass but for generated documentation. Verify that
  all links are alive, that examples compile and produce the expected
  output, etc.
* NEXT Debug information in LLVM-IR
  You should be able to run a Carth program in GDB and actually be
  able to do stuff, so we need to emit metadata about source-locations
  and stuff in the LLVM-IR. Something like the following, from the rust playground:

  #+BEGIN_EXAMPLE
  ...
  ; playground::foo
  ; Function Attrs: nonlazybind uwtable
  define internal i32 @_ZN10playground3foo17hc5d9d5678570880bE() unnamed_addr #1 !dbg !1258 {
  start:
  ; call std::panicking::begin_panic
    call void @_ZN3std9panicking11begin_panic17hb85d687efeb64e5dE([0 x i8]* noalias nonnull readonly align 1 bitcast (<{ [4 x i8] }>* @13 to [0 x i8]*), i64 4, { [0 x i64], { [0 x i8]*, i64 }, [0 x i32], i32, [0 x i32], i32, [0 x i32] }* noalias readonly align 8 dereferenceable(24) bitcast (<{ i8*, [16 x i8] }>* @12 to { [0 x i64], { [0 x i8]*, i64 }, [0 x i32], i32, [0 x i32], i32, [0 x i32] }*)), !dbg !1264
    unreachable, !dbg !1264
  }
  ; playground::main
  ; Function Attrs: nonlazybind uwtable
  define internal void @_ZN10playground4main17h7395a4a007d16efeE() unnamed_addr #1 !dbg !1265 {
  start:
  ; call playground::foo
    %0 = call i32 @_ZN10playground3foo17hc5d9d5678570880bE(), !dbg !1266
    br label %bb1, !dbg !1266

    bb1:                                              ; preds = %start
    ret void, !dbg !1267
  }
  ...
  !1266 = !DILocation(line: 6, column: 4, scope: !1265)
  #+END_EXAMPLE
* INACTIVE Guarantee no stack overflow for tail recursion
  We should guarantee that directly (and indirectly?) recursive
  function call should not cause the stack usage to grow
  indefinitely. Tail call elimination or trampolining should take
  place. Will need to look into what LLVM can do, and what's possible
  on different platforms. Hopefully we won't have to resort to
  trampolining--that seems slow.
* INACTIVE Prefer somewhat big / wide stdlib
  Small / bad standard library + good package manager => npm / cargo
  situation, where everything has sooo many dependencies. Having a dep
  is not bad per say, but when the numbers completely blow up, like in
  rust- and javascript-land, things can get messy. The best way to
  avoid this, I think, is having a standard library that has you
  covered for most common things.

  Examples of libraries in other ecosystems that should be part of the
  stdlib: `is-even` in JavaScript, `composition` in Haskell, `rand` in
  Rust.

  Go seems to have done this relatively well. Their stdlib has
  everything from JPEG codec, to a webserver. The stdlib shouldn't
  have everything though, as that will add a bunch of legacy cruft
  over time, like in Java. Would not be as much of a problem if we're
  not afraid of releasing new major versions removing deprecated
  stuff.
* INACTIVE Boxing to allow for dynamic linking
  Boxing vs monomorphization. Boxing results in smaller binary and
  dynamically-linkable interface,bot results in slower code.

  Maybe monomorphize all package-internal code, and box all
  public-facing functions?