~dvshkn/llama2.ha

Llama 2 inference in pure Hare (port of llama2.c)
use html table in readme
update readme with perf numbers, new screenshot, etc
fix tokenizer bug filtering out new line chars

refs

main
browse  log 

clone

read-only
https://git.sr.ht/~dvshkn/llama2.ha
read/write
git@git.sr.ht:~dvshkn/llama2.ha

You can also use your local clone with git send-email.

#llama2.ha

This is a port of Andrej Karpathy's llama2.c project rewritten in Hare. It has basic feature parity with the original project. Advanced features like parallelization and quantization are not implemented.

#Build

  1. Install Hare: https://harelang.org/installation/
  2. Use the included Makefile to build the llama2ha binary
cd llama2.ha
make
  1. Download a sample checkpoint like stories15M.bin or stories110M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
  1. Run llama2ha
./llama2ha stories15M.bin -i "The travelers reached a fork in the road"

#Usage

Command line flags remain virtually unchanged from llama2.c.

Example: llama2ha model.bin -n 256 -i "There once was a magical hare"

Usage: ./llama2ha [-h]
         [-t <temperature>]
         [-p <top-p>]
         [-s <seed>]
         [-n <steps>]
         [-i <prompt>]
         [-z <tokenizer>]
         <checkpoint_file>

-h: print this help text
-t <temperature>: temperature in [0,inf] (default 1.0) : float
-p <top-p>: p value for top-p (nucleus) sampling in [0,1] (default 0.9) : float
-s <seed>: random seed (defaults to current unix time) : int
-n <steps>: number of steps to run for (default 256, 0 for max_seq_len) : int
-i <prompt>: input prompt : string
-z <tokenizer>: path to custom tokenizer (optional) : string

#Performance

Test System: Ryzen 5950X, 64GB RAM

Test Flags: -i "There once was a magical hare" -s 1 -t 0

Checkpointllama2.hallama2.ha (unsafe pointers)llama2.c
stories15M.bin79.1 tok/s94.7 tok/s117.2 tok/s
stories110M.bin11.0 tok/s13.4 tok/s14.9 tok/s

Notes:

  • The default version of llama2.ha uses idiomatic Hare slices for accessing things like checkpoint weights and run state in memory.

  • The unsafe pointers version of llama2.ha uses C-style pointers which skips the language-provided bounds checking present with slices. The code for this version can be found on the unsafe-pointers branch.

#Screenshot

example screenshot of llama2.ha running in the terminal

#License

AGPL