~bptato/chagashi

Text encoders & decoders for Chawan
Version 0.7.0
Accomodate for 2024 October spec changes
Version 0.6.0

clone

read-only
https://git.sr.ht/~bptato/chagashi
read/write
git@git.sr.ht:~bptato/chagashi

You can also use your local clone with git send-email.

#Chagashi: a Nim implementation of the WHATWG encoding standard

Chagashi is a Nim text encoding/decoding library in compliance with the WHATWG standards for Chawan.

#Minimal example

First, include it in your nimble file:

requires "chagashi"

Note: following code uses the (very) high-level interface, which is rather inefficient. Lower level interfaces are normally faster.

# Makeshift iconv.
# Usage: nim r whatever.nim -f fromCharset -t toCharset <infile.txt >outfile.txt
import std/os, chagashi/[encoder, decoder, charset]

var fromCharset = CHARSET_UTF_8
var toCharset = CHARSET_UTF_8
for i in 1..paramCount():
  case paramStr(i)
  of "-f": fromCharset = getCharset(paramStr(i + 1))
  of "-t": toCharset = getCharset(paramStr(i + 1))
  else: assert false, "wrong parameter"
assert fromCharset != CHARSET_UNKNOWN and toCharset != CHARSET_UNKNOWN
let ins = stdin.readAll()
let insDecoded = ins.decodeAll(fromCharset)
if toCharset == CHARSET_UTF_8: # insDecoded is already UTF-8, nothing to do
  stdout.write(insDecoded)
else:
  stdout.write(insDecoded.encodeAll(toCharset))

#Q&A

Q: What encodings does Chagashi support?

A: All the ones you can find on https://encoding.spec.whatwg.org/, no more and no less.

Q: What is the intermediate format?

A: UTF-8, because it is the native encoding of Nim. In general, you can just take whatever non-UTF-8 string you want to decode, pass it to the decoder, and use the result immediately.

Q: What API should I use?

For decoding: the TextDecoderContext.decode() iterator provides a fairly high-level API that does no unnecessary copying, and I recommend using that where you can.

You may also use decodeAll when performance is less of a concern and/or you need the output to be in a string, or reach to decodercore directly if you really need the best performance. (In the latter case I recommend you study the decoder module first, because it's very easy to get it wrong.)

For encoding: sorry, at the moment you need to use encodercore or stick with the (non-optimal) encodeAll. I'll see if I can add an in-between API in the future.

Q: Is it correct?

A: To my knowledge, yes. However, testing is still somewhat inadequate: many single-byte encodings are not covered yet, and we do not have fuzzing either.

Q: Is it fast?

A: Not really, I have done very little optimization because it's not necessary for my use case.

If you need better performance, feel free to complain in the tickets with a specific input and I may look into it. Patches are welcome, too.

Q: How do I decode UTF-8?

A: Like any other character set. Obviously, it won't be "decoded", just validated, because the target charset is UTF-8 as well.

Previously, the API did not have a way to return views into the input data, so we had a separate UTF-8 validator API. This turned out to be very annoying to use, so the two APIs have been unified.

Q: How do I encode UTF-8?

A: You have to make sure that the UTF-8 you are passing to the encoder is at least valid WTF-8. The encoder will convert surrogate codepoints to replacement characters, but it does not validate the input byte stream.

To validate your input, you can run validateUtf8() from std/unicode, or the aforementioned TextValidatorUTF8.validate().

Q: Why no UTF-16 encoder?

A: It's not specified in the encoding standard, and I don't need one. Maybe try std/encodings.

Q: Why replace your previous character decoding library?

A: Because it didn't work.

#Thanks

To the standard authors for writing a detailed, easy to implement specification.

Chagashi's multibyte test files (test/data.tar.xz) were borrowed from Henri Sivonen's excellent encoding_rs library. His writeup on compressing the encoding data was also very helpful, and Chagashi applies similar techniques.

#License

Chagashi is dedicated to the public domain. See the UNLICENSE file for details.

Do not follow this link