Yes it does.
Does sr.ht render markdown?
Add the test program and README
It is 2023 and I need a UTF-8 / JSON transcoder for vmod_j, a JSON formatter for varnish-cache.
For a transcoder, I need a decoder.
A naive implementation is easily written off the (excellent!) UTF-8 Wikipedia page, but both the article and implementation experience make clear: A UTF-8 decoder is not a trivial endeavor, some 30-plus years after Ken Thomposon's proposal for the basic encoding lots of pitfalls are lurking and many questions seem to still be open.
The standard may say one thing, but for practical reasons, violating it may be the better option. Take the example of lossless / PEP-383 "surrogateescape" error reporting.
I had no idea...
So after reading the source code of various implementations (e.g. pcre2, glib, freebsd) I come across one implementation which, apparently, is pretty legendary - and seeing the code makes it immediately apparent, why: The Höhrmann Decoder, implemented as a deterministic finite state machine, needs only a handful lines of C code and 364 bytes for a combined character class and state transition table.
How impressive! And what a difference to other implementations!
On top of that, Björn goes into a lot of detail about the decoder, gives usage examples and a performance evaluation. It's no surprise that this decoder is super fast.
But some details are unclear. For example, Björn writes:
As written, the decoder disallows encoding of surrogate code points, overlong 2, 3, and 4 byte sequences, and 4 byte sequences outside the Unicode range.
But what about behavior regarding illegal continuations and short strings? Are overlong sequences detected as soon as possible or at a later point? What about 5 and 6 byte sequences?
In general, as trustworthy as the web page looks, how can we know how exactly the decoder behaves, and gain confidence that there are no surprises lurking in the dark?
I set out to find a test to answer these questions, and a lot of projects on UTF-8 testing exist, but I did not find a test to do what I thought should be done:
That's this project.
hoehrmann-utf8-test.c exhaustively tests all possible inputs to the decoder, that makes 269492416 different byte sequences[^1], out of which 1112063[^2] are accepted.
So I now have confidence that the decoder works correctly (by one of the many interpretations of correct in the context of UTF-8). From the test results we learn that
\xc0\x80
gets rejected.UTF8_REJECT
or UTF8_ACCEPT
.Also we gain confidence that Jörns other claims are correct:
The latter point is a relevant difference to other decoders (e.g. the one in pcre2), which (still) accept up to six byte sequences.
The code in this repository is supported by GNU Autotools. A common
build environment (autoconf
, automake
, c-compiler, ...) is required
for the default way to compile and run the test is to call from the top
level directory:
./bootstrap
make check
The output of the test can be found in src/hoehrmann-utf8-test.log
.
Alternatively, you can just compile the test manually.
If any of what I wrote is inadequate or incomplete, or if you know good resources which I have missed, please do get in touch:
-- Berlin, 2023-09-06
[^1]: the test generates longer sequences than the decoder accepts
[^2]: one less than the maximum number, because the decoder does not
accept NUL / \0
(the null byte).