I don't see a compelling reason to use tombstones in linear probing except in a ...

JacksonAllan · on Nov 11, 2023

findPreferredBucket rehashes. The problem with this deletion algorithm is that every element that is a candidate for back-shifting must be rehashed to ensure that we don't shift it back beyond its home bucket (unlike in Robin Hood probing, where probe lengths already give us this information). Hence, it is pretty unsuitable when the hash function is expensive, unless we're storing hash codes or keeping the load factor very low. This image - https://ibb.co/xSytxHF - shows a benchmark of a few hash tables with string keys and a maximum load factor of 0.95. Guess which one is using linear probing with back-shift deletion.

senderista · on Nov 11, 2023

Recalculating the hash codes is indeed potentially expensive. In my example the table stores the hash codes themselves instead of the keys (because the hash function is invertible), so the hash function never needs to be recalculated (just the preferred bucket based on the hash, which is extremely cheap). With a different algorithm (Robin Hood or bidirectional linear probing), the load factor can be kept well over 90% with good performance, as the benchmarks in the same repo demonstrate.

JacksonAllan · on Nov 12, 2023

In my example the table stores the hash codes themselves instead of the keys (because the hash function is invertible)

Oh, I see, right. If determining the home bucket is trivial, then the back-shifting method is great. The issue is just that it’s not as much of a general-purpose solution as it may initially seem. In the bad cases, it's really bad.

With a different algorithm (Robin Hood or bidirectional linear probing), the load factor can be kept well over 90% with good performance, as the benchmarks in the same repo demonstrate.

I’ve seen the 90% claim made several times in literature on Robin Hood hash tables. In my experience, the claim is a bit exaggerated, although I suppose it depends on what our idea of “good performance” is. See these benchmarks, which again go up to a maximum load factor of 0.95 (although boost and Absl forcibly grow/rehash at 0.85-0.9):

https://strong-starlight-4ea0ed.netlify.app/

Tsl, Martinus, and CC are all Robin Hood tables (https://github.com/Tessil/robin-map, https://github.com/martinus/robin-hood-hashing, and https://github.com/JacksonAllan/CC, respectively). Absl (from Google) and Boost are well-known C++ SIMD-based hash tables. Khash (https://github.com/attractivechaos/klib/blob/master/khash.h) is, I think, an ordinary open-addressing table using quadratic probing. Fastmap is a new, yet-to-be-published design that is fundamentally similar to bytell (https://www.youtube.com/watch?v=M2fKMP47slQ) but also incorporates some aspects of the aforementioned SIMD maps (in particular, it stores a 4-bit fragment of the hash code to avoid most false key comparisons).

As you can see, all the Robin Hood maps spike upwards dramatically as the load factor gets high, becoming as much as 5-6 times slower at 0.95 vs 0.5 in one of the benchmarks (uint64_t key, 256-bit struct value: Total time to erase 1000 existing elements with N elements in map). Only the SIMD maps (with Boost being the clear better performer of the two) and Fastmap appear mostly immune to load factor in all benchmarks, although the SIMD maps do - I believe - use tombstones for deletion in at least some cases.

I’ve only read briefly about bi-directional linear probing – never experimented with it.

senderista · on Nov 13, 2023

Thanks for the benchmark references.

I would be curious what you think of my benchmarks of RH and BLP: https://github.com/senderista/hashtable-benchmarks/wiki/64-b.... BLP performance seems much more stable than RH in general. I haven't directly compared BLP to SIMD designs (since the impl is in Java), but would like to do so at some point. At any rate, I haven't yet found any reason to use RH rather than BLP (note that a BLP impl could store preferred bucket index, truncated hash code, distance from preferred bucket, etc., just like RH).

JacksonAllan · on Nov 14, 2023

As I’m sure you know, benchmarking hash tables is pretty difficult because there’s many variables that affect their performance and it’s hard to cover all use cases. Let me start by explaining the benchmarks I linked you to earlier. Eventually they will be open-sourced and published as part of a comprehensive review of C (and some C++) hash tables, but that’s potentially months away now that there is a war in Gaza keeping me very busy in my day job.

The "uint32_t key, uint32_t value" benchmarks test how the hash tables perform when the hash function and key comparison function are inexpensive, traversing buckets is inexpensive (i.e. does not cause many cache misses), and moving elements is cheap. These benchmarks disadvantage tables that store metadata in a separate array (which here are Absl, Boost, Martinus, and Fastmap) because doing so necessarily causes at least one extra cache miss per lookup.

The "uint64_t key, 256-bit struct" value benchmarks test how the tables perform when the hash function and key comparison function are inexpensive, traversing buckets is expensive (a cache miss per bucket), and moving elements is expensive. These benchmarks disadvantage tables that don’t store metadata in a separate array (or do but access the buckets array with every probe anyway to check the key) and that move elements around a lot (e.g. Robin Hood).

The "16-char NULL-terminated string key, uint64_t value" benchmarks test how tables perform when the hash function and key comparison function are expensive. These benchmarks disadvantage hash tables that lack a (metadata) mechanism to avoid most key comparisons or do a lot of rehashing (this is where the performance of the liner-probing/back-shift-deletion tables goes nuclear if they’re not storing hash codes or home bucket indices).

As I mentioned earlier, in these benchmarks the max load factor is set to 95% (although the SIMD tables rehash early even after we modify the fixed max load factors hard-coded into the libraries). Measurements are taken at intervals of 50k. Each data point is the average of five runs. We can make the lines smoother, and therefore make the benchmarks more readable, by upping the runs to ten or more (although adding more runs hides the variability of the maps that suffer from this problem – e.g. notice how Khash’s plots are usually much more squiggly than those of the other tables?). This approach allows us to see the performance of each map across the whole spectrum of load factors from about 0.48 (the troughs) to 0.95 (the peaks).

Of course, these benchmarks don’t cover all cases and combinations – just a hopefully representative sample. They also don’t show memory usage (in general, the SIMD maps have one or one-and-a-fraction of a byte of overhead per bucket, Fastmap has two bytes, and the Robin Hood maps should have anywhere from two to eight bytes). They also don’t show the effect of tombstones when we do lots of deletions (e.g. the “erasure” benchmarks don’t show why the tombstoneless Fastmap may be superior to Boost in this regard).

Now, on to your benchmarks, which I had a quick look at earlier and again just now:

The first thing that jumps out at me is that it looks like you’re only testing longs as keys. In other words, you appear to only be covering the first scenario - perhaps the ideal scenario for an open-addressing table - that I described above: the hash function and key comparison function are inexpensive, traversing buckets is inexpensive, and moving elements is cheap. Your results may well vary when you change any of these variables.

The second thing I’m concerned about is the way you handle load factor and measurement intervals. Take your “Average time for successful lookup (90% load factor)” benchmark, for example. Are you just setting the max load factor to 90% and then measuring at 10k, 100k, 1m, 10m, and 100m elements? If so, then you’re not measuring the tables at a 90% load factor – you’re measuring them at whatever their load factors happen to be at those intervals. This might explain, for example, why the measurement of the Robin Hood map at 1m is so radically different from its measurement at 10m in that benchmark, or why your plots look like bell curves in your “Average time for successful lookup (99% load factor)” benchmark. If I’m right, then I think you need to either use a much smaller measurement interval so that you actually capture the performance near the target load or benchmark the tables not at element-count intervals but when they approach the target load factor. If you do the former (as I do), then there might not be any need for the separate lower-max-load-factor benchmarks because the highest-max-load-factor benchmarks will – as I mentioned earlier – also show the performance at lower load factors.

Of course, it’s nice that your horizontal scale is exponential, not linear like mine. For my benchmarks, I’m basically assuming that the trends we can clearly see for the 0-20m range will continue at higher element counts (I can’t see why they wouldn’t, and the difference made by slightly more cache misses as the element count grows should become less and less apparent at higher counts).

Anyway, that's my 2c - hope it's helpful. Let me know if you have any suggestions regarding my benchmarks. I followed you on GitHub so that I can share some more comprehensive (i.e. many more C and C++ hash tables) benchmarks with you once I have time to run the full suite again (it takes a few hours, and I only have one computer).

senderista · on Nov 14, 2023

Thanks for the details on your benchmarks. I would like sometime to extend BLP to a more generic setting; as I said I think any trick used with RH would also work with BLP. I just used an integer set because that's all I needed for my use case and it was easy to implement several different approaches for benchmarking. As you note, it favors use cases where the hash function is cheap (or invertible) and elements are cheap to move around.

About your question on load factors: no, the benchmarks are measuring exactly what they claim to be. The hash table constructor divides max data size by load factor to get the table size (https://github.com/senderista/hashtable-benchmarks/blob/mast...), and the benchmark code instantiates each hash table for exactly the measured data set size and load factor (https://github.com/senderista/hashtable-benchmarks/blob/mast...).

I can't explain the peaks around 1M in many of the plots; I didn't investigate them at the time and I don't have time now. It could be a JVM artifact, but I did try to use JMH "best practices", and there's no dynamic memory allocation or GC happening during the benchmark at all. It would be interesting to port these tables to Rust and repeat the measurements with Criterion. For more informative graphs I might try a log-linear approach: divide the intervals between the logarithmically spaced data sizes into a fixed number of subintervals (say 4).

JacksonAllan · on Nov 15, 2023

I'll try to download and play around with your benchmarks when I have a chance. After reading you explanation of how you create the tables at the desired load factor, some of those plots definitely look rather odd to me. What I'd expect to see across all your benchmarks is a bunch of upward curves tapering off at the top (or perhaps just an upward liner lines, given that your horizontal scale is exponential). Basically, the performance of a given table at the same load factor should be fundamentally similar irrespective of how many elements are in the table, except that the higher the count gets, the less frequently the table will benefit from incidental cache hits when consecutive lookups coincidentally hit the same part of the buckets arrays. You can see this in my benchmarks (except for the cumulative "Total time to insert N nonexisting elements" benchmarks and the iteration benchmarks, which are a whole other can of worms). Notice how my peaks grow higher with the element count but tapper off on the right-hand side? In contrast, your plots' data points (which I think should be analogous to the peaks in my graphs) seem to jump around, with the high-element-count points often appearing lower than the low-element count ones. This seems very unexpected.

H̵e̵r̵e̵'̵s̵ ̵o̵n̵e̵ ̵i̵d̵e̵a̵:̵ ̵I̵t̵'̵s̵ ̵b̵e̵e̵n̵ ̵m̵a̵n̵y̵ ̵y̵e̵a̵r̵s̵ ̵s̵i̵n̵c̵e̵ ̵I̵ ̵t̵o̵u̵c̵h̵e̵d̵ ̵J̵a̵v̵a̵.̵ ̵H̵o̵w̵ ̵d̵o̵e̵s̵ ̵J̵a̵v̵a̵'̵s̵ ̵g̵a̵r̵b̵a̵g̵e̵ ̵c̵o̵l̵l̵e̵c̵t̵o̵r̵ ̵w̵o̵r̵k̵?̵ ̵D̵o̵e̵s̵ ̵i̵t̵ ̵k̵i̵c̵k̵ ̵i̵n̵ ̵i̵n̵t̵e̵r̵m̵i̵t̵t̵e̵n̵t̵l̵y̵?̵ ̵C̵o̵u̵l̵d̵ ̵t̵h̵e̵ ̵g̵a̵r̵b̵a̵g̵e̵ ̵c̵o̵l̵l̵e̵c̵t̵o̵r̵ ̵b̵e̵ ̵m̵u̵d̵d̵l̵i̵n̵g̵ ̵y̵o̵u̵r̵ ̵m̵e̵a̵s̵u̵r̵e̵m̵e̵n̵t̵s̵,̵ ̵a̵n̵d̵ ̵i̵f̵ ̵s̵o̵,̵ ̵c̵a̵n̵ ̵i̵t̵ ̵b̵e̵ ̵d̵i̵s̵a̵b̵l̵e̵d̵?̵ Edit: Sorry, I just reread your comment and saw that you already addressed garbage collection.

senderista · on Nov 15, 2023

Despite my disclaimer about GC (and my effort to use JMH properly), I find it difficult to trust microbenchmarks on the JVM. I don't know when I'll have time for this, but "someday" I'd like to port this whole codebase to Rust/Criterion (which should be straightforward because the algorithms/data structures are "trivial"), and see if the more surprising artifacts persist. I do find the overall differentiation between RH and BLP surprising; I expected them to have pretty similar performance profiles.

In any case, I would definitely appreciate someone else rerunning the benchmarks on a different platform/JVM!

adgjlsfhk1 · on Nov 11, 2023

imo a pretty good approach is tombstones where you delete trailing tombstones. that way the tombstones can't overly clog the dictionary but never have to move live objects

senderista · on Nov 11, 2023

Here is a simple heuristic for avoiding most tombstones when the load factor isn't too high:

https://arxiv.org/pdf/1808.04602.pdf

celeritascelery · on Nov 11, 2023

This is a really cool approach! But if it so obvious, why doesn't every hashmap use it? It seems like there are some trade-offs here that I must be missing.

Leszek · on Nov 11, 2023

If you do any kind of probing other than linear probing (e.g. quadratic probing) this approach doesn't work anymore, because your collisions are no longer densely grouped together.

senderista · on Nov 11, 2023

The main tradeoff is concurrency: it's difficult to safely read a hash table while its entries are being concurrently relocated. Another tradeoff is (probably) higher average latency (but worst-case latency is much better since there's no global rehashing required).

teo_zero · on Nov 12, 2023

May I ask why you decided to store the hash instead of the element?

senderista · on Nov 13, 2023

Two reasons:

1. I never need to recalculate the hash in order to find the preferred bucket.

2. The BLP invariant is that entries are strictly ordered by hash code (Robin Hood can actually be reformulated in terms of the same invariant, but it's less efficient because entries with the same preferred bucket can only reside at or after that bucket rather than before, at, or after it). This allows you to view a hash table as just a sorted array with gaps, and hash table lookup as interpolation search.

teo_zero · on Nov 17, 2023

I assume you do a lot of deletions for 1. to be significant. On the other hand, you restrict yourself to a specific class of hashes: reversible and bijective (and never producing a zero for a valid input, although I think this is just how this particular table is built and can be easily changed). This a severe limitation to ever store any data other than integers, e.g. strings.

I'm not criticizing. I've never implemented a hash table myself and want to learn. I'm just trying to understand what's the trade off you considered that I don't see.

senderista · on Nov 17, 2023

Using a reversible hash function to store encoded keys is certainly a domain-specific optimization, but integer keys are so common that I don't understand why I don't see this optimization more often. For non-integer keys, you can just store truncated hash codes to eliminate hash function calls in most cases (note that the hash value is needed both to calculate the preferred bucket and to determine the order of entries in that bucket). The smaller the table, the smaller the truncated hash codes can be. When you resize the table, you can increase the truncated hash code size if necessary (one bit for each doubling, which you can amortize by adding a byte every 8 doublings). This of course requires rehashing all keys, but conventional hash tables already rehash all keys on resizing.