Add install to README
Vanity URL
Add AlreadyLower, crumb a few nanos
This is a reworked fork of the English Porter2 stemmer from Dmitry Chestnykn's https://github.com/dchest/stemmer/ package, optimised as much for performance as possible while preserving the original algorithm's shape.
The porter2 stemming algorithm is described here: http://snowball.tartarus.org/algorithms/english/stemmer.html
Install:
go get -u go.shabbyrobe.org/porter2
This fork operates on bytes rather than runes internally, so may have issues with Unicode safety, though tests with UTF-8 chars are passing. Most of the algorithm is just matching on ASCII characters anyway, so it's highly unlikely to cause problems.
There are two functions provided by this version:
Stem(string) string
: The same as the upstream repo. Creates a copy of the
incoming string. About 30% slower than StemBytes
, and also creates garbage,
but safer.StemBytes([]byte) []byte
: Unsafe variant. Mutates the input word and returns
the truncated result. Creates no garbage.Also, I haven't exactly wrung every last drop out of this, I tapped out after getting it nearly 90% faster. If you decide you need even more speed, I'd love to hear about what crazy tricks you pull to drag more performance out of this! It's fast enough for me for now though, so I've stopped.
I have prepared this fork to suit my own strange needs, and will continue to hack on it as required.
If you would like to take advantage of this stemmer's performance improvements, I strongly recommend either forking or vendoring as I will not guarantee any stability, and may even decide to trade some accuracy for more speed at some point (but will endeavour to hide this behind a flag if possible).
I endeavour to respond to issues as quickly as I can, but I make no promises. Pull requests are unlikely to be accepted without a conversation prior to commencement.
Here is the output of benchcmp
after running on my i7-8550U @ 1.8GHz (note
that this compares upstream's Stem
with this repo's StemBytes
):
benchmark old ns/op new ns/op delta
BenchmarkStem-8 1721 230 -86.64%
benchmark old allocs new allocs delta
BenchmarkStem-8 0 0 +0.00%
benchmark old bytes new bytes delta
BenchmarkStem-8 12 0 -100.00%
Included test_output.txt
and test_voc.txt
are from the referenced original
implementations, used only when running tests with go test
.
2-clause BSD-like (see LICENSE and AUTHORS files).