use html table in readme
update readme with perf numbers, new screenshot, etc
fix tokenizer bug filtering out new line chars
This is a port of Andrej Karpathy's llama2.c project rewritten in Hare. It has basic feature parity with the original project. Advanced features like parallelization and quantization are not implemented.
Makefile
to build the llama2ha
binarycd llama2.ha
make
stories15M.bin
or stories110M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
llama2ha
./llama2ha stories15M.bin -i "The travelers reached a fork in the road"
Command line flags remain virtually unchanged from llama2.c.
Example: llama2ha model.bin -n 256 -i "There once was a magical hare"
Usage: ./llama2ha [-h]
[-t <temperature>]
[-p <top-p>]
[-s <seed>]
[-n <steps>]
[-i <prompt>]
[-z <tokenizer>]
<checkpoint_file>
-h: print this help text
-t <temperature>: temperature in [0,inf] (default 1.0) : float
-p <top-p>: p value for top-p (nucleus) sampling in [0,1] (default 0.9) : float
-s <seed>: random seed (defaults to current unix time) : int
-n <steps>: number of steps to run for (default 256, 0 for max_seq_len) : int
-i <prompt>: input prompt : string
-z <tokenizer>: path to custom tokenizer (optional) : string
Test System: Ryzen 5950X, 64GB RAM
Test Flags: -i "There once was a magical hare" -s 1 -t 0
Checkpoint | llama2.ha | llama2.ha (unsafe pointers) | llama2.c |
---|---|---|---|
stories15M.bin | 79.1 tok/s | 94.7 tok/s | 117.2 tok/s |
stories110M.bin | 11.0 tok/s | 13.4 tok/s | 14.9 tok/s |
Notes:
The default version of llama2.ha uses idiomatic Hare slices for accessing things like checkpoint weights and run state in memory.
The unsafe pointers version of llama2.ha uses C-style pointers which skips the language-provided bounds checking present with slices. The code for this version can be found on the unsafe-pointers
branch.
AGPL