This repository has been archived on 2024-04-17. You can view files and clone it, but cannot push or open issues/pull-requests.
similar-sort/BENCHMARKING.md

236 lines
9.1 KiB
Markdown
Raw Permalink Normal View History

2021-08-24 20:05:08 +00:00
# Benchmarking
This program started out life as a fairly simple Go program that just runs everything in a single thread.
I'm rewriting it in Rust for learning, but also because I think Rayon might let me make this problem parallelizable!
Here's an initial benchmark on a `/usr/share/dict/words` of 235,886 lines:
```
$ hyperfine './result/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 307.4 ms ± 6.0 ms [User: 273.5 ms, System: 75.5 ms]
Range (min … max): 298.3 ms … 317.2 ms 10 runs
```
Let's see if we can get any faster than that!
2021-08-24 20:55:58 +00:00
## First naive attempt
Just getting something working:
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 2.579 s ± 0.053 s [User: 2.535 s, System: 0.017 s]
Range (min … max): 2.537 s … 2.725 s 10 runs
```
So... much worse than my naive attempt in Go.
However, this version is also doing proper argument parsing and has a little nicer error handling.
Let's try and make it parallel!
## Rayon
Rayon is nice!
Simply substituting `sort_by_key` for `par_sort_by_key` makes this a ton faster.
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 609.6 ms ± 9.1 ms [User: 3.748 s, System: 0.029 s]
Range (min … max): 598.9 ms … 629.1 ms 10 runs
```
2021-08-24 21:17:52 +00:00
It's still twice as slow as the Go version, though!
2021-08-24 20:55:58 +00:00
Wow!
2021-08-24 21:17:52 +00:00
## Without error handling
2021-08-24 20:55:58 +00:00
2021-08-24 21:17:52 +00:00
There are not very many things that can go wrong in this program.
Let's just try unwrapping and panicking?
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 612.9 ms ± 15.0 ms [User: 3.768 s, System: 0.030 s]
Range (min … max): 595.7 ms … 640.2 ms 10 runs
```
2021-08-24 21:24:59 +00:00
Well, not that then.
## Without `strsim`
Maybe `strsim` is doing something inefficient?
What if we try, say, `levenshtein`, which appears to operate on strings directly instead?
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 715.6 ms ± 6.9 ms [User: 4.483 s, System: 0.033 s]
Range (min … max): 705.7 ms … 725.5 ms 10 runs
```
So, no to that too!
2021-08-24 21:50:08 +00:00
## Removing arg parsing overhead?
What if it's creating that big Clap struct that's causing problems?
Let's give structopt a try (and then move to deriving from Clap once 3.0.0 is finally released.)
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 667.5 ms ± 7.1 ms [User: 4.158 s, System: 0.031 s]
Range (min … max): 658.3 ms … 678.2 ms 10 runs
```
Ok, seems fine!
2021-08-24 21:57:11 +00:00
## Bump allocator
What if deallocation is the problem?
We don't do anything fancy in `Drop` other than flushing the final output... let's try!
(Using [bump_alloc](https://crates.io/crates/bump_alloc))
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 1.351 s ± 0.030 s [User: 5.809 s, System: 2.811 s]
Range (min … max): 1.321 s … 1.406 s 10 runs
```
... well, no. Probably not a good idea.
2021-08-24 22:00:28 +00:00
## Unstable Sort
Looking at the Go implementation again, it looks like I used an unstable sort instead of a stable one.
OK, that's fine, we'll grab that speedup:
```
hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 502.2 ms ± 6.6 ms [User: 660.1 ms, System: 12.7 ms]
Range (min … max): 491.5 ms … 513.2 ms 10 runs
```
And quite a speedup it is!
About 150ms over the previous improvement.
2021-08-24 22:08:09 +00:00
## Precalculate sizes
This is so much of a bigger result that I wonder if we're doing more work in parallel than we really need to?
What if we compute the distances in a `map` instead of doing it in the parallel code?
```
$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 170.1 ms ± 4.5 ms [User: 158.5 ms, System: 12.6 ms]
Range (min … max): 163.5 ms … 181.6 ms 16 runs
```
Yay!
That finally gave us the result we wanted this whole time!
It's way faster!
A real comparison:
```
$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 287.8 ms ± 5.1 ms [User: 254.8 ms, System: 75.3 ms]
Range (min … max): 282.3 ms … 296.2 ms 10 runs
Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 165.8 ms ± 7.4 ms [User: 154.5 ms, System: 12.2 ms]
Range (min … max): 155.8 ms … 186.2 ms 15 runs
Summary
'./target/release/similar-sort benchmark < /usr/share/dict/words' ran
1.74 ± 0.08 times faster than './result/bin/similar-sort define < /usr/share/dict/words'
```
2021-08-24 22:15:44 +00:00
## Calculating sizes in parallel
What if we calculated the size in parallel?
Could we get it even faster?
```
hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 295.0 ms ± 5.6 ms [User: 259.3 ms, System: 76.5 ms]
Range (min … max): 287.4 ms … 305.2 ms 10 runs
Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 153.5 ms ± 3.4 ms [User: 143.0 ms, System: 11.0 ms]
Range (min … max): 147.5 ms … 163.0 ms 19 runs
Summary
'./target/release/similar-sort benchmark < /usr/share/dict/words' ran
1.92 ± 0.06 times faster than './result/bin/similar-sort define < /usr/share/dict/words'
```
Yep!
2021-08-24 22:23:35 +00:00
## Bump Allocation (again)
Let's try the bump allocator again... it seemed like a fluke that it produced such a severe performance degradation.
```
$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 286.2 ms ± 5.0 ms [User: 251.4 ms, System: 72.5 ms]
Range (min … max): 280.8 ms … 298.4 ms 10 runs
Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 111.3 ms ± 3.4 ms [User: 94.7 ms, System: 16.0 ms]
Range (min … max): 104.1 ms … 118.5 ms 24 runs
Summary
'./target/release/similar-sort benchmark < /usr/share/dict/words' ran
2.57 ± 0.09 times faster than './result/bin/similar-sort define < /usr/share/dict/words'
```
That seems more like what I'd expect!
2021-08-24 22:24:30 +00:00
## Coda
2.57 times faster for an hour and a half of work is pretty good!
Next time I'm running a Linux machine, maybe I'll try some of the fancier Linux-only Rust performance tools; maybe there's more to be gained here!
2021-08-24 22:31:08 +00:00
Once I get a `naersk` build with all the optimizations enabled, here's the final result:
```
$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './go-result/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
Time (mean ± σ): 99.1 ms ± 2.3 ms [User: 83.7 ms, System: 16.0 ms]
Range (min … max): 95.9 ms … 104.6 ms 28 runs
Benchmark #2: ./go-result/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 304.4 ms ± 4.4 ms [User: 269.9 ms, System: 77.1 ms]
Range (min … max): 299.9 ms … 313.8 ms 10 runs
Summary
'./result/bin/similar-sort define < /usr/share/dict/words' ran
3.07 ± 0.08 times faster than './go-result/bin/similar-sort benchmark < /usr/share/dict/words'
```
2021-08-26 16:02:29 +00:00
## Coda II
Well, turns out I forgot to have the program calculate the edit distance in parallel.
It's even faster now!
```
$ hyperfine -L impl iter,par_iter './{impl}/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./iter/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 118.6 ms ± 2.2 ms [User: 102.4 ms, System: 15.7 ms]
Range (min … max): 113.9 ms … 123.3 ms 24 runs
Benchmark #2: ./par_iter/bin/similar-sort benchmark < /usr/share/dict/words
Time (mean ± σ): 91.7 ms ± 2.1 ms [User: 247.9 ms, System: 113.5 ms]
Range (min … max): 88.3 ms … 95.6 ms 31 runs
Summary
'./par_iter/bin/similar-sort benchmark < /usr/share/dict/words' ran
1.29 ± 0.04 times faster than './iter/bin/similar-sort benchmark < /usr/share/dict/words'
```