~jojo/Carth

ref: 12cf2a697999acc0ccc7419e498f96349a0a3267 Carth/src/Gen.hs -rw-r--r-- 45.2 KiB
Remove DWARF debugging metadata in LLVM output & make funcs External

Multiple reasons for removing the debugging symbols. 1) The
implementation was only partial and buggy. Line-by-line stepping
didn't work well, quite jumpy and buggy. 2) I'm not sure the
line-by-line stepping even *could* have worked well, due to the
expression oriented nature of the lang. 3) Having to remember source
positions this long is cumbersome. 4) less code for any reason =
good! 5) Even without the metadata, as long as the functions are
External (is this really required? I think so, but may be possible to
work around) the function names will be visible in gdb, and that's
really all that I need / can use that works well.
Don't generate simply numbered locals in LLVM

I.e., no more UnName 8 => "%8". Instead Name "tmp_8" => "%tmp_8". This
fixes the issue that LLVM expects all simply numbered locals to be
defined in order and without gaps. This means our .dbg.gen.ll won't
compile, which could have been useful.
Fix LLVM function visibility issue

Tried to compile the generated .ll. Clang was complaining when it was
both Private and Hidden. Just setting visibility to Default fixes
it. Don't think this is a problem, but I guess we will learn.
Duplicate AST defs from Monomorphic to Low

Also had to add a bunch of basically no-op translation in Lower.hs
just to fix the types. Atm it's pointless, but it will serve as a base
to work on as the Low AST drifts from the Monomorphic AST.
rename Optimize{,d} -> Low{er,}
Update stackage release & use default-extensions in cabal file

Also, fix some minor breakages caused by ghc update, fix the
literate.org example, fix some new warnings, and get rid of the need
for a bunch of Data implementations by using basic parsing functions
in SystemSpec.hs.
Codegen: Don't put every var on stack. Keep small things in regs

Some things, like integers, seldom need to be on the stack, so
`alloca`ing space for them and needing to load them when it's time to
e.g. add them together is wasteful. Instead, use `passByRef` as a
heuristic to not needlessly put values on the stack that will probably
never need to be used as such. On the other hand, it's still good to
keep larger structs on the stack. As the LLVM docs say, load/store on
individual struct members is to be prefered over
insertvalue/extractvalue due to performance reasons.
Codegen: Do more at Val & ptr level everywhere. getelemementptr etc.

The website said to "avoid loads and stores of large aggregate type"
to improve performance. This seems like solid advice, so this commit
changes a bunch of different stuff in the codegen to keep structs in
pointers for as long as possible and use getelementptr instead of
extractvalue/insertvalue. Particularly in the pattern matching stuff,
change the selection stuff to operate on Val instead of operand, using
new convenience functions like genIndexStruct which does extractvalue
if the Val is a local, and getelementptr if it's a stack var.

Performance improvement is not obviously noticable for my small
programs, but it seems less code is generated overall. .dbg.ll of
Fizzbuzz decreased from like 1700 to 1500 lines or something. That's
at least a 10% improvement.

https://llvm.org/docs/Frontend/PerformanceTips.html#avoid-loads-and-stores-of-large-aggregate-type
Use private linkage for internal items

Haven't noticed much difference, but should in theory allow the
compiler to optimize more, like inlining and stuff.
Explicitly mark extern calls with notail

Not sure if it actually makes a difference, but it probably doesn't
hurt to be explicit.
Include macro expansion trace in SrcPos. Better err msgs!

    POS1: Error:
      CODE
    MESSAGE

    POS2: Note:
      CODE
    In expansion of macro.
Fix workaround for JIT not detect. glob vars by calling GC_add_roots

Boehm GC is smart, and normally detects global variables as roots by
scanning the static segment of the ELF executable. However, when
JITtin in LLVM, there is no such segment, and all roots are not
detected. The previous workaround was to treat global non-function
vars basically like locals, and capture them in closures etc. This
guaranteed that they were always available on the stack, in a
register, or on the heap via another root. This was wasteful though,
and many closures that could've had empty envs had them filled with
crud.

New method simply calls GC_add_roots after initializing each global
var in init.
Make main IO Unit, i.e. (Fun RealWorld [Unit RealWorld])
Update TODO
Repr Unit as [0xi8]. Alt to dceded5 for fixing TCO for Unit ret funs

This method feels less edge-case-y. Effect is the same.

As a refresher, the problem was that for tail recursive functions with
return type Unit, LLVM would optimize
  %x = tail call {} foo()
  ret {} %x
to
  %x = tail call {} foo()
  ret {} zeroinitializer
However, this "optimization" would prevent a later tail-call
optimization from happening, as TCO requires us to return either void,
or the local of the return value of the last call. zeroinitializer may
have the same value as %x, but the optimizer doesn't recognize it as
fulfilling the conditions, so TCO didn't happen. By representing Unit
as an empty array instead, this problem doesn't happen, because %x is
simply not replaced with zeroinitializer when the type is an array
type. Don't know why this is the case, but it works out.
Elim need for mods LLCompunit, LLSubprog with DuplicateRecordFields

They only existed to reexport a subset of LLOp in order to avoid name
collisions in record fields. The language extension
DuplicateRecordFields eliminates the need for this, by using magic to
disambiguate when there are duplicate fields in different records in
the same module.
Fix builtin virtual calls being lambda wrapped even when saturated

So e.g. `+` was applied like ((+-wrapper a) b) instead of
like (+-operator [a b]), kind of.

This fix together with the previous commit reduces the amount of
closure-juggling logic going on, and improves performance quite a bit
in many cases. As an example, the runtime for (ackermann 4 1) went
from something like 260s to around 50s. Still a lot of overhead
though, compared to the C version which took like 2s...
Fix typo in GenErr variant (NoBulitin... -> NoBuiltin...)
Add skeleton module Optimize between Monomorphize & Compile
Make (i8*) the generic ptr type instead of ({}*)

I just think it will play nicer with LLVM. Everything gets messy when
zero-sized types are involved.
Next