9.1 KiB
Benchmarking
This program started out life as a fairly simple Go program that just runs everything in a single thread. I'm rewriting it in Rust for learning, but also because I think Rayon might let me make this problem parallelizable!
Here's an initial benchmark on a /usr/share/dict/words
of 235,886 lines:
$ hyperfine './result/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 307.4 ms ± 6.0 ms [User: 273.5 ms, System: 75.5 ms]
Range (min … max): 298.3 ms … 317.2 ms 10 runs
Let's see if we can get any faster than that!
First naive attempt
Just getting something working:
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 2.579 s ± 0.053 s [User: 2.535 s, System: 0.017 s]
Range (min … max): 2.537 s … 2.725 s 10 runs
So... much worse than my naive attempt in Go. However, this version is also doing proper argument parsing and has a little nicer error handling. Let's try and make it parallel!
Rayon
Rayon is nice!
Simply substituting sort_by_key
for par_sort_by_key
makes this a ton faster.
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 609.6 ms ± 9.1 ms [User: 3.748 s, System: 0.029 s]
Range (min … max): 598.9 ms … 629.1 ms 10 runs
It's still twice as slow as the Go version, though! Wow!
Without error handling
There are not very many things that can go wrong in this program. Let's just try unwrapping and panicking?
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 612.9 ms ± 15.0 ms [User: 3.768 s, System: 0.030 s]
Range (min … max): 595.7 ms … 640.2 ms 10 runs
Well, not that then.
Without strsim
Maybe strsim
is doing something inefficient?
What if we try, say, levenshtein
, which appears to operate on strings directly instead?
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 715.6 ms ± 6.9 ms [User: 4.483 s, System: 0.033 s]
Range (min … max): 705.7 ms … 725.5 ms 10 runs
So, no to that too!
Removing arg parsing overhead?
What if it's creating that big Clap struct that's causing problems? Let's give structopt a try (and then move to deriving from Clap once 3.0.0 is finally released.)
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 667.5 ms ± 7.1 ms [User: 4.158 s, System: 0.031 s]
Range (min … max): 658.3 ms … 678.2 ms 10 runs
Ok, seems fine!
Bump allocator
What if deallocation is the problem?
We don't do anything fancy in Drop
other than flushing the final output... let's try!
(Using bump_alloc)
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 1.351 s ± 0.030 s [User: 5.809 s, System: 2.811 s]
Range (min … max): 1.321 s … 1.406 s 10 runs
... well, no. Probably not a good idea.
Unstable Sort
Looking at the Go implementation again, it looks like I used an unstable sort instead of a stable one. OK, that's fine, we'll grab that speedup:
hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 502.2 ms ± 6.6 ms [User: 660.1 ms, System: 12.7 ms]
Range (min … max): 491.5 ms … 513.2 ms 10 runs
And quite a speedup it is! About 150ms over the previous improvement.
Precalculate sizes
This is so much of a bigger result that I wonder if we're doing more work in parallel than we really need to?
What if we compute the distances in a map
instead of doing it in the parallel code?
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 170.1 ms ± 4.5 ms [User: 158.5 ms, System: 12.6 ms]
Range (min … max): 163.5 ms … 181.6 ms 16 runs
Yay! That finally gave us the result we wanted this whole time! It's way faster!
A real comparison:
$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 287.8 ms ± 5.1 ms [User: 254.8 ms, System: 75.3 ms]
Range (min … max): 282.3 ms … 296.2 ms 10 runs
Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 165.8 ms ± 7.4 ms [User: 154.5 ms, System: 12.2 ms]
Range (min … max): 155.8 ms … 186.2 ms 15 runs
Summary
'./target/release/similar-sort benchmark < /usr/share/dict/words' ran
1.74 ± 0.08 times faster than './result/bin/similar-sort define < /usr/share/dict/words'
Calculating sizes in parallel
What if we calculated the size in parallel? Could we get it even faster?
hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 295.0 ms ± 5.6 ms [User: 259.3 ms, System: 76.5 ms]
Range (min … max): 287.4 ms … 305.2 ms 10 runs
Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 153.5 ms ± 3.4 ms [User: 143.0 ms, System: 11.0 ms]
Range (min … max): 147.5 ms … 163.0 ms 19 runs
Summary
'./target/release/similar-sort benchmark < /usr/share/dict/words' ran
1.92 ± 0.06 times faster than './result/bin/similar-sort define < /usr/share/dict/words'
Yep!
Bump Allocation (again)
Let's try the bump allocator again... it seemed like a fluke that it produced such a severe performance degradation.
$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 286.2 ms ± 5.0 ms [User: 251.4 ms, System: 72.5 ms]
Range (min … max): 280.8 ms … 298.4 ms 10 runs
Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 111.3 ms ± 3.4 ms [User: 94.7 ms, System: 16.0 ms]
Range (min … max): 104.1 ms … 118.5 ms 24 runs
Summary
'./target/release/similar-sort benchmark < /usr/share/dict/words' ran
2.57 ± 0.09 times faster than './result/bin/similar-sort define < /usr/share/dict/words'
That seems more like what I'd expect!
Coda
2.57 times faster for an hour and a half of work is pretty good! Next time I'm running a Linux machine, maybe I'll try some of the fancier Linux-only Rust performance tools; maybe there's more to be gained here!
Once I get a naersk
build with all the optimizations enabled, here's the final result:
$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './go-result/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 99.1 ms ± 2.3 ms [User: 83.7 ms, System: 16.0 ms]
Range (min … max): 95.9 ms … 104.6 ms 28 runs
Benchmark #2: ./go-result/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 304.4 ms ± 4.4 ms [User: 269.9 ms, System: 77.1 ms]
Range (min … max): 299.9 ms … 313.8 ms 10 runs
Summary
'./result/bin/similar-sort define < /usr/share/dict/words' ran
3.07 ± 0.08 times faster than './go-result/bin/similar-sort benchmark < /usr/share/dict/words'
Coda II
Well, turns out I forgot to have the program calculate the edit distance in parallel. It's even faster now!
$ hyperfine -L impl iter,par_iter './{impl}/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./iter/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 118.6 ms ± 2.2 ms [User: 102.4 ms, System: 15.7 ms]
Range (min … max): 113.9 ms … 123.3 ms 24 runs
Benchmark #2: ./par_iter/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 91.7 ms ± 2.1 ms [User: 247.9 ms, System: 113.5 ms]
Range (min … max): 88.3 ms … 95.6 ms 31 runs
Summary
'./par_iter/bin/similar-sort benchmark < /usr/share/dict/words' ran
1.29 ± 0.04 times faster than './iter/bin/similar-sort benchmark < /usr/share/dict/words'