~slink/hoehrmann-utf8

An exhaustive test for the legendary Höhrmann UTF8 decoder
690b8ebc — Nils Goroll 1 year, 3 months ago
Yes it does.
79c0e805 — Nils Goroll 1 year, 3 months ago
Does sr.ht render markdown?
970332f8 — Nils Goroll 1 year, 3 months ago
Add the test program and README

refs

trunk
browse  log 

clone

read-only
https://git.sr.ht/~slink/hoehrmann-utf8
read/write
git@git.sr.ht:~slink/hoehrmann-utf8

You can also use your local clone with git send-email.

#An Exhaustive Test Program for the Legendary Höhrmann UTF-8 Decoder

#Intro

It is 2023 and I need a UTF-8 / JSON transcoder for vmod_j, a JSON formatter for varnish-cache.

For a transcoder, I need a decoder.

A naive implementation is easily written off the (excellent!) UTF-8 Wikipedia page, but both the article and implementation experience make clear: A UTF-8 decoder is not a trivial endeavor, some 30-plus years after Ken Thomposon's proposal for the basic encoding lots of pitfalls are lurking and many questions seem to still be open.

The standard may say one thing, but for practical reasons, violating it may be the better option. Take the example of lossless / PEP-383 "surrogateescape" error reporting.

I had no idea...

So after reading the source code of various implementations (e.g. pcre2, glib, freebsd) I come across one implementation which, apparently, is pretty legendary - and seeing the code makes it immediately apparent, why: The Höhrmann Decoder, implemented as a deterministic finite state machine, needs only a handful lines of C code and 364 bytes for a combined character class and state transition table.

How impressive! And what a difference to other implementations!

On top of that, Björn goes into a lot of detail about the decoder, gives usage examples and a performance evaluation. It's no surprise that this decoder is super fast.

#Some Questions

But some details are unclear. For example, Björn writes:

As written, the decoder disallows encoding of surrogate code points, overlong 2, 3, and 4 byte sequences, and 4 byte sequences outside the Unicode range.

But what about behavior regarding illegal continuations and short strings? Are overlong sequences detected as soon as possible or at a later point? What about 5 and 6 byte sequences?

In general, as trustworthy as the web page looks, how can we know how exactly the decoder behaves, and gain confidence that there are no surprises lurking in the dark?

I set out to find a test to answer these questions, and a lot of projects on UTF-8 testing exist, but I did not find a test to do what I thought should be done:

#Exhaustively Test all Possible Inputs

That's this project.

hoehrmann-utf8-test.c exhaustively tests all possible inputs to the decoder, that makes 269492416 different byte sequences[^1], out of which 1112063[^2] are accepted.

So I now have confidence that the decoder works correctly (by one of the many interpretations of correct in the context of UTF-8). From the test results we learn that

  • The decoder does not implement "Java U+0000" (MUTF-8) - the sequence \xc0\x80 gets rejected.
  • Overlong sequences are rejected as soon as possible.
  • Obviously, if the sequence ends with NUL before its correctness can be validated, the decoder is left in an internal state other than UTF8_REJECT or UTF8_ACCEPT.

Also we gain confidence that Jörns other claims are correct:

  • UTF-16 surrogates (code points 0xd800 to 0xdff) are rejected
  • code points above 0x10ffff and sequences longer than four bytes are rejected.

The latter point is a relevant difference to other decoders (e.g. the one in pcre2), which (still) accept up to six byte sequences.

#Running the test

The code in this repository is supported by GNU Autotools. A common build environment (autoconf, automake, c-compiler, ...) is required for the default way to compile and run the test is to call from the top level directory:

./bootstrap
make check

The output of the test can be found in src/hoehrmann-utf8-test.log.

Alternatively, you can just compile the test manually.

#Please get in touch!

If any of what I wrote is inadequate or incomplete, or if you know good resources which I have missed, please do get in touch:

-- Berlin, 2023-09-06

[^1]: the test generates longer sequences than the decoder accepts

[^2]: one less than the maximum number, because the decoder does not accept NUL / \0 (the null byte).