brian

Archived

This repository has been archived on 2024-04-17. You can view files and clone it, but cannot push or open issues/pull-requests.

9.1 KiB

Raw Blame History

Benchmarking

This program started out life as a fairly simple Go program that just runs everything in a single thread. I'm rewriting it in Rust for learning, but also because I think Rayon might let me make this problem parallelizable!

Here's an initial benchmark on a /usr/share/dict/words of 235,886 lines:

$ hyperfine './result/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     307.4 ms ±   6.0 ms    [User: 273.5 ms, System: 75.5 ms]
  Range (min … max):   298.3 ms … 317.2 ms    10 runs

Let's see if we can get any faster than that!

First naive attempt

Just getting something working:

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):      2.579 s ±  0.053 s    [User: 2.535 s, System: 0.017 s]
  Range (min … max):    2.537 s …  2.725 s    10 runs

So... much worse than my naive attempt in Go. However, this version is also doing proper argument parsing and has a little nicer error handling. Let's try and make it parallel!

Rayon

Rayon is nice! Simply substituting sort_by_key for par_sort_by_key makes this a ton faster.

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     609.6 ms ±   9.1 ms    [User: 3.748 s, System: 0.029 s]
  Range (min … max):   598.9 ms … 629.1 ms    10 runs

It's still twice as slow as the Go version, though! Wow!

Without error handling

There are not very many things that can go wrong in this program. Let's just try unwrapping and panicking?

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     612.9 ms ±  15.0 ms    [User: 3.768 s, System: 0.030 s]
  Range (min … max):   595.7 ms … 640.2 ms    10 runs

Well, not that then.

Without `strsim`

Maybe strsim is doing something inefficient? What if we try, say, levenshtein, which appears to operate on strings directly instead?

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     715.6 ms ±   6.9 ms    [User: 4.483 s, System: 0.033 s]
  Range (min … max):   705.7 ms … 725.5 ms    10 runs

So, no to that too!

Removing arg parsing overhead?

What if it's creating that big Clap struct that's causing problems? Let's give structopt a try (and then move to deriving from Clap once 3.0.0 is finally released.)

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     667.5 ms ±   7.1 ms    [User: 4.158 s, System: 0.031 s]
  Range (min … max):   658.3 ms … 678.2 ms    10 runs

Ok, seems fine!

Bump allocator

What if deallocation is the problem? We don't do anything fancy in Drop other than flushing the final output... let's try! (Using bump_alloc)

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):      1.351 s ±  0.030 s    [User: 5.809 s, System: 2.811 s]
  Range (min … max):    1.321 s …  1.406 s    10 runs

... well, no. Probably not a good idea.

Unstable Sort

Looking at the Go implementation again, it looks like I used an unstable sort instead of a stable one. OK, that's fine, we'll grab that speedup:

hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     502.2 ms ±   6.6 ms    [User: 660.1 ms, System: 12.7 ms]
  Range (min … max):   491.5 ms … 513.2 ms    10 runs

And quite a speedup it is! About 150ms over the previous improvement.

Precalculate sizes

This is so much of a bigger result that I wonder if we're doing more work in parallel than we really need to? What if we compute the distances in a map instead of doing it in the parallel code?

$ hyperfine './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     170.1 ms ±   4.5 ms    [User: 158.5 ms, System: 12.6 ms]
  Range (min … max):   163.5 ms … 181.6 ms    16 runs

Yay! That finally gave us the result we wanted this whole time! It's way faster!

A real comparison:

$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
  Time (mean ± σ):     287.8 ms ±   5.1 ms    [User: 254.8 ms, System: 75.3 ms]
  Range (min … max):   282.3 ms … 296.2 ms    10 runs

Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     165.8 ms ±   7.4 ms    [User: 154.5 ms, System: 12.2 ms]
  Range (min … max):   155.8 ms … 186.2 ms    15 runs

Summary
  './target/release/similar-sort benchmark < /usr/share/dict/words' ran
    1.74 ± 0.08 times faster than './result/bin/similar-sort define < /usr/share/dict/words'

Calculating sizes in parallel

What if we calculated the size in parallel? Could we get it even faster?

hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
  Time (mean ± σ):     295.0 ms ±   5.6 ms    [User: 259.3 ms, System: 76.5 ms]
  Range (min … max):   287.4 ms … 305.2 ms    10 runs

Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     153.5 ms ±   3.4 ms    [User: 143.0 ms, System: 11.0 ms]
  Range (min … max):   147.5 ms … 163.0 ms    19 runs

Summary
  './target/release/similar-sort benchmark < /usr/share/dict/words' ran
    1.92 ± 0.06 times faster than './result/bin/similar-sort define < /usr/share/dict/words'

Yep!

Bump Allocation (again)

Let's try the bump allocator again... it seemed like a fluke that it produced such a severe performance degradation.

$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './target/release/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
  Time (mean ± σ):     286.2 ms ±   5.0 ms    [User: 251.4 ms, System: 72.5 ms]
  Range (min … max):   280.8 ms … 298.4 ms    10 runs

Benchmark #2: ./target/release/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     111.3 ms ±   3.4 ms    [User: 94.7 ms, System: 16.0 ms]
  Range (min … max):   104.1 ms … 118.5 ms    24 runs

Summary
  './target/release/similar-sort benchmark < /usr/share/dict/words' ran
    2.57 ± 0.09 times faster than './result/bin/similar-sort define < /usr/share/dict/words'

That seems more like what I'd expect!

Coda

2.57 times faster for an hour and a half of work is pretty good! Next time I'm running a Linux machine, maybe I'll try some of the fancier Linux-only Rust performance tools; maybe there's more to be gained here!

Once I get a naersk build with all the optimizations enabled, here's the final result:

$ hyperfine './result/bin/similar-sort define < /usr/share/dict/words' './go-result/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./result/bin/similar-sort define < /usr/share/dict/words
  Time (mean ± σ):      99.1 ms ±   2.3 ms    [User: 83.7 ms, System: 16.0 ms]
  Range (min … max):    95.9 ms … 104.6 ms    28 runs

Benchmark #2: ./go-result/bin/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     304.4 ms ±   4.4 ms    [User: 269.9 ms, System: 77.1 ms]
  Range (min … max):   299.9 ms … 313.8 ms    10 runs

Summary
  './result/bin/similar-sort define < /usr/share/dict/words' ran
    3.07 ± 0.08 times faster than './go-result/bin/similar-sort benchmark < /usr/share/dict/words'

Coda II

Well, turns out I forgot to have the program calculate the edit distance in parallel. It's even faster now!

$ hyperfine -L impl iter,par_iter './{impl}/bin/similar-sort benchmark < /usr/share/dict/words'
Benchmark #1: ./iter/bin/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):     118.6 ms ±   2.2 ms    [User: 102.4 ms, System: 15.7 ms]
  Range (min … max):   113.9 ms … 123.3 ms    24 runs

Benchmark #2: ./par_iter/bin/similar-sort benchmark < /usr/share/dict/words
  Time (mean ± σ):      91.7 ms ±   2.1 ms    [User: 247.9 ms, System: 113.5 ms]
  Range (min … max):    88.3 ms …  95.6 ms    31 runs

Summary
  './par_iter/bin/similar-sort benchmark < /usr/share/dict/words' ran
    1.29 ± 0.04 times faster than './iter/bin/similar-sort benchmark < /usr/share/dict/words'

9.1 KiB Raw Blame History Unescape Escape