One major issue with disk based hash tables is rehashing. Actually, in-memory ha...

kgwxd · on Dec 10, 2019

I've seen it happen in Azure Blob Storage when a container hit about 10 million files, it took 2 hours.

dodobirdlord · on Dec 10, 2019

Isn't this problem solved entirely by consistent hashing? Even for a non-distributed hash table, you can "distribute" your table over a collection of fixed-size blocks in memory or on disk. Maximum cost of rehashing is the cost of splitting a block, which can be made arbitrarily small. The cost can even be paid in parallel in non-pathological scenarios with a push-down/write-through splitting mechanism.

anchpop · on Dec 10, 2019

Yes, rehashing is very frustrating, especially because the hash tables default to most languages have that issue (i.e. std::unordered_map). It means that you may need 3x more available memory than what you actually want to store, because when the hash table resizes it has to allocate the size of the hash table twice over, while keeping the original hash table in memory.

senderista · on Dec 10, 2019

That doesn't happen in databases. The standard incremental hashing algorithms (linear and extendible hashing) were devised for databases. Both of these algorithms split only one bucket at a time and thus avoid large allocations.