~quf/computers-are-fast-2020

they are.
Draw the rest of the fine owl.

refs

trunk
browse  log 

clone

read-only
https://git.sr.ht/~quf/computers-are-fast-2020
read/write
git@git.sr.ht:~quf/computers-are-fast-2020

You can also use your local clone with git send-email.

#Computers are fast.

Or: C is my favourite scripting language

The goal of this project was to solve all parts of Advent of Code 2020:

  • With no dependencies other than a C99 compiler and libc,^(1,2)
  • where every solution compiles and runs in 1 s or less,^3
  • the total compile- and runtime does not exceed 10 s,^3
  • while running single-threaded, and
  • never exceeding 256 MiB memory use.^(3,4)

^1 and, optionally, make.

^2 Some programs contain commonly held assumptions, like "uint32_t is available", or "exit code 1 signals failure", or "the stack doesn't overflow". The programs may cause undefined behaviour if the input is not an unmodified one from AoC 2020. If undefined behaviour (such as signed integer overflow) does occur for an unmodified AoC 2020 input, that's a bug and I'd like to know about it.

^3 on the author's ~8 year old laptop with a i5-3320M CPU.

^4 virtual memory.

#Results

The project goal was almost, but not quite, achieved.

Every day runs in less than one second, and the total runtime is less than two seconds. All but three of the programs compile and run faster than it takes Python just to start up and do literally nothing (i.e. 35 ms). The memory limit was no problem either: only day 5 comes somewhat close to the limit, needing 120 MiB.

I was only able to get the runtime for day 15 below one second consistently by implementing the hot loop in inline assembly. The program is still portable as the inline assembly is only used by TCC under amd64 and all other compilers fall back to plain C, but this version is not fast enough unless with compiled with heavy optimization options, which take too long at compile time.

The final compile/runtimes (on my laptop, best of ten) are as follows:

Day 01: 2.73 ms
Day 02: 2.88 ms
Day 03: 3.49 ms
Day 04: 4.69 ms
Day 05: 2.84 ms
Day 06: 2.98 ms
Day 07: 5.03 ms
Day 08: 4.43 ms
Day 09: 3.02 ms
Day 10: 3.06 ms
Day 11: 31.9 ms
Day 12: 3.2 ms
Day 13: 3.18 ms
Day 14: 19.9 ms
Day 15: 906 ms
Day 16: 4.99 ms
Day 17: 36.7 ms
Day 18: 3.54 ms
Day 19: 4.68 ms
Day 20: 20 ms
Day 21: 4.86 ms
Day 22: 11.4 ms
Day 23: 590 ms
Day 24: 38.8 ms
Day 25: 13.3 ms
Total: 1.91 s

Day 15 is responsible for 47% of the runtime, and day 23 is responsible for another 31%. The remaining 23 days only take 240 ms to run.

Note that the total runtime is the smallest of ten complete runs, not the sum of individual best times. The latter is about 200 ms shorter than the former.

#How to run

First, copy or link the inputs files in the subfolder inputs/, with names 01, 02, …, 25. Then, if you're on a POSIX-y system and have tcc installed, the following will compile and run all programs and measure the completion time:

$ tcc -run run-all.c

Alternatively, if you have a C compiler and make installed, run:

$ time make

This will be much slower. Some compilers (e.g. clang on FreeBSD) may refuse to compile some programs because the makefile invokes the compiler with very strict error conditions. These errors are probably harmless suggestions and invoking make with the environment variable CFLAGS=-Wno-error should produce working executables.

To run a program individually, run:

$ tcc -run src/01.c < inputs/01

Or:

$ make run-01

To run each program repeatedly and print the best time, run:

$ tcc -Drepeat=10 -run time.c

If you only have make, you will need to make do with this (again, it will be much slower):

$ time make clean run-01

#Track/limit allocations

Track allocations with make bin/01 && valgrind bin/01 < inputs/01.

Enforce memory limit with ulimit -v 262144 (bash/fish shell on linux).

#Day 1

Read every input number; create a lookup table of numbers (entry 0 if not present, 1 if present). For every input number x (part 2: pair of numbers), check if the table at 2020-x is set.

#Day 2

Who needs regular expressions when you have scanf?

#Day 3

Modular arithmetic.

#Day 4

Read and validate input.

#Day 5

Seat specs are binary expansions of seat IDs.

The input is implicitely sorted by setting indices in a bool table.

#Day 6

Set union and intersection. Sets are represented by 26 bits of a uint32_t.

#Day 7

The problem itself is quite simple. Reading the input is a massive pain:

  • Read the input line by line.

  • Tokenize the input (split into words) and parse using a big switch that implements the following state machine:

finite state machine flowchart, see parsing-07.dot

To simplify memory management and speed up comparisons, each two-word color combination is assigned a number. All colors have the form "modifier base", with 17 different modifiers and 33 different base colors. The modifier can be mapped to a 5 bit number, and the base can be mapped to a 6 bit number. The bitwise concatenation of these numbers is the color id.

All rules can be stored in a statically allocated array at the index of the outer bag id.

Part 1:

  • While reading, generate a graph with flipped connections (transpose graph): For each inner bag, connect to the outer bags which need to contain that inner bag.

  • Starting at the ID of "shiny gold", walk through the graph with inverted connections and count the nodes encountered.

Part 2:

  • While reading, generate a graph with the straightforward connections: For each outer bag, connect to the required bags inside it, and tag with the number.

  • Starting at the ID of "shiny gold", recursively compute the number of contained bags by walking through graph.

In neither case are partial results memoized. Doing this may or may not reduce the runtime.

#Day 8

Part 1: Implement the VM, stop once the value of the instruction pointer repeats.

Part 2 is done with brute force: Change instructions one by one, check if the program terminates.

#Day 9

Part 1: Brute force.

Part 2: Sliding window sum.

#Day 10

Read the adapter joltages into a bool table; this implicitely sorts them.

Part 1: Count the jolt differences.

Part 2: Dynamic programming, starting with the largest joltage, for which we know there is exactly one combination.

#Day 11

There doesn't seem to be a way to shortcut the calculation, so just do it. Optimizations:

  • Use linear indices instead of Cartesian indices.

  • Precompute the neighbours of each seat instead of computing them on every update.

  • Avoid branches by computing all occupied neighbours before checking the update condition, and unrolling loops.

#Day 12

Basic analytic geometry.

#Day 13

In part 1, the time until the bus with period t arrives after timestamp t_0 is t-(t_0%t).

In part 2, the solution time t satisfies a system of linear congruences like this:

(t + 0) = 0 % bus_1_period
(t + 1) = 0 % bus_2_period

It is solved with the Chinese Remainder Theorem. When computed naively, interim values can become very large (~80 bits), so some care is taken to avoid overflow.

#Day 14

Memory is saved in a hashmap.

A previous version used a more elegent sparse binary tree, but this was slower by a factor of > 2.

#Day 15

The sequence is a variant of the Van Eck sequence, of which not much is known apparently. There doesn't appear to be any way to get the answer other than the obvious.

Caveat: I was unable to optimize the program to run in less than one second with just plain C99. After rewriting the hot loop in amd64 asm, it does run in less than one second on my laptop. Only TCC under amd64 makes use of this; all other compilers will use the slightly slower generic C implementation.

#Day 16

Parsing is the most difficult part again.

Part 2 is solved with a greedy algorithm that looks for a rule that can only fit to one field, then removes that possibility from the possibilities for all other rules, and repeats. This works since the assignment is unique.

#Day 17

These are the rules Conway's game of life.

Cell status is saved in an array of appropriate size. For performance, the sum over all neighbours is computed branchlessly, with the loop unrolled. A progressively larger size of the grid is looped over on every iteration as the size in each dimension can grow at most by one per step.

This problem seems to benefit very much from compiler optimizations. gcc -O1 produces an exectuable that is about 4 to 5 as fast as one produces by gcc -O0 (or tcc). Perhaps this is a cue that my approach can be improved, but I don't see how.

Alternatively, a hashtable might be beneficial for performance since the number of active cells is a very small proportion of the grid. I can't be bothered to try that though.

#Day 18

Expressions are transformed and evaluated using a variant of the shunting yard algorithm with an inlined evaluator.

#Day 19

The given rules describe a context-free grammar. In part one, it is even finite, as the rules are acyclic.

A general approach for part two would require a general CGF recognizer, such as the Earley or CYK algorithms. These are too slow for our purposes, so we have to take advantage of the structure of the input. It appears that all inputs have the following properties:

  • Rule 0, the starting production, is 0: 8 11; rule 8 is 8: 42; rule 11 is 11: 42 31

  • Rules 42 and 31, which are the only ones referred to by 8 and 11 (other than themselves) directly, produce a finite (and rather small) language.

The latter property allows for the generation of a finite list of all strings produced by rules 42 and 31; let S_42 and S_31 these two sets.

In part 1, rule 0 produces the language S_42^2 S_31. In part 2, rule 0 produces the language ∪_{n,m} S_42^(n+m) S_31^n.

Computing S_31 and S_42, then checking this property is reasonably straightforward.

For further optimization, note that the alphabet is binary, and all strings in S_31 and S_42 have length 8. Moreover, S_31 and S_42 both contain 128 strings and are disjoint. As a result, each string in S_42 and S_31 can be represented by one byte; valid strings can be compressed cleanly into bytestrings by replacing a with a zero-bit and b by a one-bit. Indeed, all messages in the input can, since they are all a multiple of eight characters long.

#Day 20

This is the absolute worst.

  • Read the tiles into a list. Convert image data into binary. Remove the edge of the tile data. For faster matching, turn each edge into a 10 bit number (identifying '#' and 1, '.' and 0), in both clockwise and counterclockwise direction.

  • Find a tile for which exactly two edges (in counterclockwise direction only) match at least one other tile. (Luckily, such a tile exists.) This is a corner tile; put it in a grid at position (0, 0). Orient it such that the two edges which have a match point down and right (towards growing array indices). Save the orientation.

  • Fill the first row: Find a match for the right edge of the previous tile and reorient it to match.

  • Fill all other rows: For each position, find a match for the bottom edge of the tile above.

This procedure ultimately matches all tiles because each match is unique (modulo flippin or rotating the whole grid of tiles together).

Now we can solve part one by multiplying the values of the corner tiles.

  • For part 2, fix the orientation of the data of every tile; then join all tile data together in a big array.

  • Then convolve the complete picture with the monster image and find points where the result is fifteen (the number of set bits in the "monster kernel"). The convolution is done naively, because I will not implement FFT for a puzzle that's already way too finicky to be fun just to potentially make it slightly faster. Any such point is a monster; remove the corresponding bits.

  • Repeat the above with different monster orientations until at least one match is found.

  • Count the remaining set image bits.

There's probably some untapped optimization potential and the code is not as general as one might hope, but I can hardly express how done I am with it, especially since it already only takes a rather small proportion of the total runtime.

#Day 21

This one is conceptually the same as day 16. It's more difficult because the numbers are bigger.

#Day 22

Sub-games can be skipped if player 1 holds the highest card (in the decks of the sub-game) for a significant speedup. Otherwise, just do the thing. Previous deck arrangements are saved in a hashset.

#Day 23

The cups are stored in an array. The array index is the cup value (array[0] is the cup labelled 1, array[1] is the cup labelled 2, etc.); the array stores the index of the cup clockwise from it. This allows for constant-time removal and insertion of cups, tracking the next cup, and locating the cup labelled 1.

#Day 24

Tiles are identified by their coeffients with respect to a Bravais basis:

a hexagonal lattice with few points and coordinates marked: the center coordinate is (0, 0); east is (1, 0); north-east is (0, 1); north-west is (1, -1), etc.

Points are saved in a dense array.

Possible optimizations include: Using a hashmap; precomputing neighbours.

#Day 25

The problem is to compute the discrete logarithm in a prime number field. It is solved with the baby step giant step algorithm.