Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this article misses two points more important than anything else mentioned.

1) B-trees are dynamic while hash tables are not. A B-tree grows gracefully across orders of magnitude, and shrinks just as easily. Most hash tables do not have this property. Extensible hashing, and linear hashing do grow and shrink gracefully, but I'm not sure how widely used they are.

2) In a database system, the concern was traditionally to minimize page accesses. CPU is not negligible, and now there are main memory databases, as well as RAM sizes much larger than they were in the 70s and 80s. However, a lot of the main ideas in traditional OLTP databases were developed a long time ago, and the resulting architecture is still very much in use. So how many page reads to do a key lookup in a hash table? One. End of story. How many page reads to do a key lookup in a B-tree? Or to be more precise, a B+-tree, which has a much higher branching factor? Probably one. The root, and most likely the second level of pages stay cached, so it's really one page access to get a 3rd-level page. And, of course, as mentioned in the article, B-trees give you sequential access in key order, for when that's important.



On modern hardware a key lookup in a hash table isn't necessarily a single page read! Sure, it's a single virtual memory access, but if that page isn't in your TLB you need to read the page table... and if the page containing that part of the page table isn't in the TLB you need to read that page...

On modern hardware, every memory access looks very much like a B-tree lookup.


TLB hierarchy is not a B-tree, it is a trie in all of the CPUs. Very different layout (not balanced and also hard sized), much faster on happy path.

Making a 48-bit B-tree would have a bit of a memory problem making TLB huge.

And then CPU cache is an array. Single virtual memory access is bound to be a single physical as well, with minor exceptions for NUMA nodes being crossed.


There's a reason I said "looks very much like" rather than "is". ;-)


But the point remains, walking the cache hierarchy once for a hash table is faster than walking it log times for a Btree.


> and if the page containing that part of the page table isn't in the TLB you need to read that page...

I thought page tables used physical addresses, which are accessed directly without any TLB lookup (except when nested paging during virtualization, which adds another level of indirection). Of course, the processor still needs to read each level into the data cache(s) while doing a page table walk.


Typically. Although on a virtualized Arm what the guest views as a physical address is really an intermediate physical address that must be translated by the second stage MMU. So it’s possible that reading the first stage page tables can cause a page fault in the second stage MMU. I suspect modern x86 works similarly, but I’m less familiar with that.


Page tables can have multiple levels. For example in x86_64 you'd have 4 levels, i.e the virtual->physical mapping is implemented as a tree with depth 4, where each leave and internal node of such tree is 4kb (page size). (As usual, details are more complicated than that)


Yes, and each level of the tree has the physical address of the next level, so no TLB lookup is necessary (the top of the tree, in the TTBRn or equivalent registers, is also a physical address).


oh yeah, I totally misread the comment: a page fetch is required, but that has nothing to do with the TLB indeed


what's the difference? "page fetch" is not really something that can be googled on the web.


the TLB is just one element of the process that leads to resolve a virtual address into a physical one: it's a cache that hosts the most recently resolved addresses.

When the virtual address you're looking to resolve is not present in that cache (i.e. when you have TLB miss), the CPU falls back to walking the page table hierarchy. At each level of the tree, the CPU reads an physical address of the next level of the tree and performs a memory fetch of that page table entry (in my previous comment I erroneously said a "page fetch", but it's actually only performing a cache-line sized fetch) and repeatedly so until it reaches the leaves of the tree which contain the Page Table Entry that contains the physical address of the (4k) physical page associated with the virtual page address you wanted to resolve.


If huge pages are used then it's very likely the page is cached in the MMU.


Depends on your workload and how many TLB entries your CPU has for superpages. The Zen 2 TLB can hold tons (1000s) of 2MB superpages but relatively few (64) 1GB superpages. Older CPU models had worse capacity for 1GB and 2MB superpages. E.g., Haswell (2013) had only 32 entries for 2MB superpages and 4 entries for 1GB superpages (data).


In addition to the limited number of cache slots available for superpages (varies depending on cpu), remember that those can be invalidated (again, depending on cpu). If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.


> If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.

Is that really likely on a production database server?


Depends on the design. At a minimum you're ping-ponging between userland and kernel; but you might also be bouncing between a transport layer unwrapper, an authentication front-end, the database core, and a storage back-end.


Ideally, with io_uring(2)/DPDK/SR-IOV-NVMe, you can skip enough syscalls to drop their performance impact below 1%.


Spectre may have changed things, but for a long time a user->kernel switch didn't imply a full TLB flush.


And that’s assuming you’ve missed several layers of cache between the CPU and main (virtual) memory!


This is not how TLB lookup works


The OP is talking about disk page reads, not virtual memory pages.


Your both points (very valid) could perhaps be wrapped up as:

Indexes have lower computational complexity for certain tasks (in particular range queries & sort), but higher constants than typical hashing.

Thus indexes make more sense with two-tiered memory; to cache in RAM metadata about data kept in block storage.

How that could be put to good use with the L1 cache being much faster than RAM is anyone's guess[1].

--

[1] without guessing, we already know a small interpreter that fits in L1 can be blazing fast.


Rust's default (edit: ordered) map structure is a memory Btree, fwiw.


What do you mean by this? There is no default map type. The standard library has types such as BTreeMap and HashMap, that pretty much are what it says on the tin.


Ordered map.


Didn't the article address 1) with this?:

> Another difference is that hash tables only provide average constant time accesses. Bad behaviour can be caused by hash collisions, either unintentional or malicious, but a bigger issue is rehashing. Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table. The impact of this can be reduced by using multi-level hash tables, but it still causes some unpredictability. Trees, on the other hand, have worst case performance of O(log n).


No. This discussion is about algorithmic complexity, typically counting comparisons. And of course, it ignores constants. If we're counting page accesses, we care very much about the constants.

And furthermore, a "multi-level hash table" is a tree, except an unordered one. What a useless beast!


Yes, mostly. There are better structures with these characteristics like a skiplist, multihashed table (typically double hashed) or extensible hashing.


>> What a useless beast!

Mostly but not entirely. There are cases where you need a tree but you don't care about order, e.g. when order depends on a state that is unknown at time of writing or when all you want to encode are relationships.


Also, JSON and its ilk.


"Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table."

This my impression and it seems like it implies that the "hash table has O(1) access" is bullshit taken as a general formulation. The situation is essentially, hash tables can have "O(1) access" time at a certain size and if tuned for that size. BUT that seems an abuse of big-O notation, which is meant (originally, in math) to mean "as you go to infinity" and you can't really have a data structure that mains O(1) as it's size increases to infinity.


> you can't really have a data structure that mains O(1) as it's size increases to infinity.

At the level of abstraction that comexity analysis is usually done, you can. For example, a lookup in an array (a[X]) has O(1) worse-case complexity no matter how large the array is.

Of course, this relies on a few abstractions - that random memory access has uniform cost; and that basic numeric operations have uniform cost regardless of the size of the operands. Of course, in real computers, both of these assumptions are wrong (adding terabyte-size numbers is not as fast as adding byte-sized numbers, large amounts of memory can't usually be stored on random-access hardware etc.), but that does not necessarily impact the usefulness of complexity analysis.

For hash-tables in particular, the worse-case complexity for key lookup depends on some guarantees from the hash algorithm. If you had a magical hash that is unique for any key in your domain, than lookup would be O(1) for a hash table of n elements, as n approaches infinity. Since that is not a good abstraction, we need to model the table using hash functions that can create collisions, and the worse case becomes that O(n) elements collide and get put into the same bucket, giving worse-case complexity of O(n). Adding elements to the hash table is a different operation, same as for the array.


> that seems an abuse of big-O notation

Formally it's amortized O(1) time and pessimistic O(n) time.

It's not abuse - many algorithms and data structures have different performance in average and worst case - for example quicksort is O(n log n) but in worst case it's O(n^2). People just ignore the worst case unless worst-case guarantees are especially important for their system.


The actual analysis of inserts are that they take "amortized O(1) with a single operation worst case of O(N)"

So N operations take N measurement units as N goes to infinity, but any individual operation may be slower or faster.


Growing hash tables only takes amortized O(1); you just do two operations per logical operation, which is still O(1). One search term is "incremental resizing."

https://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing


In practical applications, costs are often not amortized.

If lucky transactions are more than fast enough but unlucky transactions time out because they perform a rebuild it's a defect that cannot be accepted because average performance is good.

Most databases need to take care of worst case performance with techniques that increase complexity and degrade best-case and aggregate performance (e.g. maintaining two hash tables, with lookup accessing both, insertions into the preferred one, and background threads that move records from the old table to the new table).


I may have misspoken and should have left out "amortized," perhaps. It doesn't sound like you're familiar with incremental resizing or other real-time techniques?[1]

> Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.

[1]: https://en.wikipedia.org/wiki/Hash_table#Alternatives_to_all...


It's amortized O(1) even without any special incremental resizing. I think the goal of incremental resizing is to get closer to actual O(1).

But I'm not sure incremental resizing can get to actual O(1). Is the malloc() call for the new memory O(1)? And then after malloc() wouldn't the memory have to be initialized somehow (maybe just 0-initialized)?


malloc() isn't O(1), but allocations in a normal GC are O(1).


Is there some reason that malloc() would be slower asymptotically than a GC? Why couldn't whatever technique is used for the fast GC be used for malloc() as well?


GCs are typically allowed to (and often do) move objects, while malloc/free do not. This means that malloc/free must maintain data structures tracking allocated vs. free space, while compacting GCs keep all the free space in a single contiguous address space, so allocating an object just increments the "start of free memory" pointer.


Of course, that just moves the cost from the allocation side to the GC side. This means that the GC must maintain data structures tracking the type of each object (to know which fields are references), the root references, and several others (like card tables) for an efficient implementation. For malloc/free, releasing an object can just add it to a free list (a pair of pointer writes), while a compacting GC has to update all references pointing to the live objects it moves.


Yep! And this a pretty specific design element that I don't think gets appropriate attention when introducing GCs as a concept. There's a big difference between saying malloc vs. gc is doing "more work" or "less work" implying faster and slower.

Doing more work but elsewhere can (possibly) make your program faster. GC's often eat more CPU and RAM which hurts when you need more of either. But C/C++ can occasionally put malloc pressure on your critical path hurting you when you have plenty of resources where a GC might actually help you.

If it was any simpler it wouldn't be so damn interesting and fun.


malloc implementations often involve some kind of a search to find a free area that can satisfy the request. See http://www.gii.upv.es/tlsf/ for an O(1) implementation - use a large number of free lists, one per size, using bit scan instructions on the block size to select the list to use. To malloc, grab the first entry, and put any remaining portion on the appropriate free list.

To free, add the block to the appropriate free list. Assume blocks participate in a doubly-linked list of all blocks, used and free alike, in address order - adjacent free blocks can be coalesced at this point.


Relational databases do mostly range queries. (e.g. SELECT a WHERE b > ?). Hash tables suck for that since in general they reshuffle (hash) keys. In tree structures 1m key range access is 1 random lookup and 1m sequential (cache-friendly). In hash tables that becomes 1m random lookups and not cache friendly.

See section IV 4 E, "range queries" in "A comparison of adaptive radix trees and hash tables" (2015)

https://15721.courses.cs.cmu.edu/spring2019/papers/08-oltpin...

Even for the best hash table-based indexing B+Trees win. And combining this with the big cost of resizing (growing) or rehashing a table (rebalancing, i.e. Cuckoo) they are not useful.

Perhaps some adaptive hashing method could take care of this in the future.


I don't think this is correct (about mostly range queries). Scan the code base of any application, and I predict you will see 90%+ queries not involving an inequality. That is for OLTP. For analytics, the answer might be different.


Yup, aligning the B-Tree nodes to database pages (which are themselves aligned to physical disk blocks) is the key.

No pun intended ;)


> In a database system, the concern was traditionally to minimize page accesses

Is it really about memory access or disk loads?


Disk.

Though btrees are friendly to cpu caches too, but really it's about minimising disk accesses.


Why can’t hash tables shrink and grow gracefully?


Because items are placed into buckets based on their hash, and as you add or remove buckets you need to redistribute items.

I guess it depends what your definition is of ‘graceful’.


A b-tree grows and shrinks gracefully because each update modifies O(log n) pages. And the base of the logarithm is pretty big.

A typical hash table is not graceful in this way, because while each insert is O(1), you occasionally need to rehash everything into a bigger table, and that is O(n). As I noted above, linear and extensible hashing avoid this problem. An insert that forces a reorg just reorgs one bucket.


It might be down to memory/space allocation after all.

All hashtable requires rehashing up to certain fill rate, otherwise it will regress to linear look up. For the two most popular implementations, linked-list based or open addressing based, the common strategy for rehashing is to create another instance of hash table and copy the KV over. But on disk, this will be disastrous.


They can. See linear hashing and extensible hashing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: