~shabbyrobe/go-porter2

Performance-oriented fork of the English porter2 stemmer github.com/dchest/stemmer
65fd2ebc — Blake Williams 9 months ago
Add install to README
ddf3d6d5 — Blake Williams 1 year, 21 days ago
Vanity URL
878ac863 — shabbyrobe 4 years ago
Add AlreadyLower, crumb a few nanos

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~shabbyrobe/go-porter2
read/write
git@git.sr.ht:~shabbyrobe/go-porter2

You can also use your local clone with git send-email.

#Faster English Porter2 Stemmer for Go

This is a reworked fork of the English Porter2 stemmer from Dmitry Chestnykn's https://github.com/dchest/stemmer/ package, optimised as much for performance as possible while preserving the original algorithm's shape.

The porter2 stemming algorithm is described here: http://snowball.tartarus.org/algorithms/english/stemmer.html

Install:

go get -u go.shabbyrobe.org/porter2

This fork operates on bytes rather than runes internally, so may have issues with Unicode safety, though tests with UTF-8 chars are passing. Most of the algorithm is just matching on ASCII characters anyway, so it's highly unlikely to cause problems.

There are two functions provided by this version:

  • Stem(string) string: The same as the upstream repo. Creates a copy of the incoming string. About 30% slower than StemBytes, and also creates garbage, but safer.
  • StemBytes([]byte) []byte: Unsafe variant. Mutates the input word and returns the truncated result. Creates no garbage.

Also, I haven't exactly wrung every last drop out of this, I tapped out after getting it nearly 90% faster. If you decide you need even more speed, I'd love to hear about what crazy tricks you pull to drag more performance out of this! It's fast enough for me for now though, so I've stopped.

#Expectation Management

I have prepared this fork to suit my own strange needs, and will continue to hack on it as required.

If you would like to take advantage of this stemmer's performance improvements, I strongly recommend either forking or vendoring as I will not guarantee any stability, and may even decide to trade some accuracy for more speed at some point (but will endeavour to hide this behind a flag if possible).

I endeavour to respond to issues as quickly as I can, but I make no promises. Pull requests are unlikely to be accepted without a conversation prior to commencement.

#Silly Benchmarks Game

Here is the output of benchcmp after running on my i7-8550U @ 1.8GHz (note that this compares upstream's Stem with this repo's StemBytes):

benchmark           old ns/op     new ns/op     delta
BenchmarkStem-8     1721          230           -86.64%

benchmark           old allocs     new allocs     delta
BenchmarkStem-8     0              0              +0.00%

benchmark           old bytes     new bytes     delta
BenchmarkStem-8     12            0             -100.00%

#Tests

Included test_output.txt and test_voc.txt are from the referenced original implementations, used only when running tests with go test.

#License

2-clause BSD-like (see LICENSE and AUTHORS files).