Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One major issue with disk based hash tables is rehashing. Actually, in-memory hash tables have the same problem, but most people don't notice it. Imagine your database suddenly duplicating itself because you inserted one row and suddenly overloaded a bucket.

Some hash tables that are designed to minimize the potential for DDOS attacks and improve the amortization of work use gradual rehashing (Go's hash table does this), but this is also not ideal for databases because the associated disk accesses are so expensive, and it requires significantly more disk space.

Many real world implementations just don't rehash unless you ask, which makes them pretty inefficient for most use cases.



I've seen it happen in Azure Blob Storage when a container hit about 10 million files, it took 2 hours.


Isn't this problem solved entirely by consistent hashing? Even for a non-distributed hash table, you can "distribute" your table over a collection of fixed-size blocks in memory or on disk. Maximum cost of rehashing is the cost of splitting a block, which can be made arbitrarily small. The cost can even be paid in parallel in non-pathological scenarios with a push-down/write-through splitting mechanism.


Yes, rehashing is very frustrating, especially because the hash tables default to most languages have that issue (i.e. std::unordered_map). It means that you may need 3x more available memory than what you actually want to store, because when the hash table resizes it has to allocate the size of the hash table twice over, while keeping the original hash table in memory.


That doesn't happen in databases. The standard incremental hashing algorithms (linear and extendible hashing) were devised for databases. Both of these algorithms split only one bucket at a time and thus avoid large allocations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: