I think this article misses two points more important than anything else mention...

cperciva · on Dec 10, 2019

On modern hardware a key lookup in a hash table isn't necessarily a single page read! Sure, it's a single virtual memory access, but if that page isn't in your TLB you need to read the page table... and if the page containing that part of the page table isn't in the TLB you need to read that page...

On modern hardware, every memory access looks very much like a B-tree lookup.

AstralStorm · on Dec 10, 2019

TLB hierarchy is not a B-tree, it is a trie in all of the CPUs. Very different layout (not balanced and also hard sized), much faster on happy path.

Making a 48-bit B-tree would have a bit of a memory problem making TLB huge.

And then CPU cache is an array. Single virtual memory access is bound to be a single physical as well, with minor exceptions for NUMA nodes being crossed.

cperciva · on Dec 10, 2019

There's a reason I said "looks very much like" rather than "is". ;-)

CoolGuySteve · on Dec 10, 2019

But the point remains, walking the cache hierarchy once for a hash table is faster than walking it log times for a Btree.

cesarb · on Dec 10, 2019

> and if the page containing that part of the page table isn't in the TLB you need to read that page...

I thought page tables used physical addresses, which are accessed directly without any TLB lookup (except when nested paging during virtualization, which adds another level of indirection). Of course, the processor still needs to read each level into the data cache(s) while doing a page table walk.

repiret · on Dec 10, 2019

Typically. Although on a virtualized Arm what the guest views as a physical address is really an intermediate physical address that must be translated by the second stage MMU. So it’s possible that reading the first stage page tables can cause a page fault in the second stage MMU. I suspect modern x86 works similarly, but I’m less familiar with that.

ithkuil · on Dec 10, 2019

Page tables can have multiple levels. For example in x86_64 you'd have 4 levels, i.e the virtual->physical mapping is implemented as a tree with depth 4, where each leave and internal node of such tree is 4kb (page size). (As usual, details are more complicated than that)

cesarb · on Dec 10, 2019

Yes, and each level of the tree has the physical address of the next level, so no TLB lookup is necessary (the top of the tree, in the TTBRn or equivalent registers, is also a physical address).

ithkuil · on Dec 10, 2019

oh yeah, I totally misread the comment: a page fetch is required, but that has nothing to do with the TLB indeed

narnianal · on Dec 10, 2019

what's the difference? "page fetch" is not really something that can be googled on the web.

ithkuil · on Dec 10, 2019

the TLB is just one element of the process that leads to resolve a virtual address into a physical one: it's a cache that hosts the most recently resolved addresses.

When the virtual address you're looking to resolve is not present in that cache (i.e. when you have TLB miss), the CPU falls back to walking the page table hierarchy. At each level of the tree, the CPU reads an physical address of the next level of the tree and performs a memory fetch of that page table entry (in my previous comment I erroneously said a "page fetch", but it's actually only performing a cache-line sized fetch) and repeatedly so until it reaches the leaves of the tree which contain the Page Table Entry that contains the physical address of the (4k) physical page associated with the virtual page address you wanted to resolve.

jdsully · on Dec 10, 2019

If huge pages are used then it's very likely the page is cached in the MMU.

loeg · on Dec 10, 2019

Depends on your workload and how many TLB entries your CPU has for superpages. The Zen 2 TLB can hold tons (1000s) of 2MB superpages but relatively few (64) 1GB superpages. Older CPU models had worse capacity for 1GB and 2MB superpages. E.g., Haswell (2013) had only 32 entries for 2MB superpages and 4 entries for 1GB superpages (data).

cperciva · on Dec 10, 2019

In addition to the limited number of cache slots available for superpages (varies depending on cpu), remember that those can be invalidated (again, depending on cpu). If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.

davedx · on Dec 10, 2019

> If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.

Is that really likely on a production database server?

cperciva · on Dec 10, 2019

Depends on the design. At a minimum you're ping-ponging between userland and kernel; but you might also be bouncing between a transport layer unwrapper, an authentication front-end, the database core, and a storage back-end.

namibj · on Dec 11, 2019

Ideally, with io_uring(2)/DPDK/SR-IOV-NVMe, you can skip enough syscalls to drop their performance impact below 1%.

jdsully · on Dec 10, 2019

Spectre may have changed things, but for a long time a user->kernel switch didn't imply a full TLB flush.

notduncansmith · on Dec 10, 2019

And that’s assuming you’ve missed several layers of cache between the CPU and main (virtual) memory!

_pd19 · on Dec 10, 2019

This is not how TLB lookup works

slashdev · on Dec 10, 2019

The OP is talking about disk page reads, not virtual memory pages.

dexen · on Dec 10, 2019

Your both points (very valid) could perhaps be wrapped up as:

Indexes have lower computational complexity for certain tasks (in particular range queries & sort), but higher constants than typical hashing.

Thus indexes make more sense with two-tiered memory; to cache in RAM metadata about data kept in block storage.

How that could be put to good use with the L1 cache being much faster than RAM is anyone's guess[1].

--

[1] without guessing, we already know a small interpreter that fits in L1 can be blazing fast.

loeg · on Dec 10, 2019

Rust's default (edit: ordered) map structure is a memory Btree, fwiw.

microtonal · on Dec 10, 2019

What do you mean by this? There is no default map type. The standard library has types such as BTreeMap and HashMap, that pretty much are what it says on the tin.

loeg · on Dec 10, 2019

Ordered map.

cwalv · on Dec 10, 2019

Didn't the article address 1) with this?:

> Another difference is that hash tables only provide average constant time accesses. Bad behaviour can be caused by hash collisions, either unintentional or malicious, but a bigger issue is rehashing. Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table. The impact of this can be reduced by using multi-level hash tables, but it still causes some unpredictability. Trees, on the other hand, have worst case performance of O(log n).

geophile · on Dec 10, 2019

No. This discussion is about algorithmic complexity, typically counting comparisons. And of course, it ignores constants. If we're counting page accesses, we care very much about the constants.

And furthermore, a "multi-level hash table" is a tree, except an unordered one. What a useless beast!

AstralStorm · on Dec 10, 2019

Yes, mostly. There are better structures with these characteristics like a skiplist, multihashed table (typically double hashed) or extensible hashing.

misterman0 · on Dec 10, 2019

>> What a useless beast!

Mostly but not entirely. There are cases where you need a tree but you don't care about order, e.g. when order depends on a state that is unknown at time of writing or when all you want to encode are relationships.

andrewflnr · on Dec 11, 2019

Also, JSON and its ilk.

joe_the_user · on Dec 10, 2019

"Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table."

This my impression and it seems like it implies that the "hash table has O(1) access" is bullshit taken as a general formulation. The situation is essentially, hash tables can have "O(1) access" time at a certain size and if tuned for that size. BUT that seems an abuse of big-O notation, which is meant (originally, in math) to mean "as you go to infinity" and you can't really have a data structure that mains O(1) as it's size increases to infinity.

tsimionescu · on Dec 10, 2019

> you can't really have a data structure that mains O(1) as it's size increases to infinity.

At the level of abstraction that comexity analysis is usually done, you can. For example, a lookup in an array (a[X]) has O(1) worse-case complexity no matter how large the array is.

Of course, this relies on a few abstractions - that random memory access has uniform cost; and that basic numeric operations have uniform cost regardless of the size of the operands. Of course, in real computers, both of these assumptions are wrong (adding terabyte-size numbers is not as fast as adding byte-sized numbers, large amounts of memory can't usually be stored on random-access hardware etc.), but that does not necessarily impact the usefulness of complexity analysis.

For hash-tables in particular, the worse-case complexity for key lookup depends on some guarantees from the hash algorithm. If you had a magical hash that is unique for any key in your domain, than lookup would be O(1) for a hash table of n elements, as n approaches infinity. Since that is not a good abstraction, we need to model the table using hash functions that can create collisions, and the worse case becomes that O(n) elements collide and get put into the same bucket, giving worse-case complexity of O(n). Adding elements to the hash table is a different operation, same as for the array.

ajuc · on Dec 10, 2019

> that seems an abuse of big-O notation

Formally it's amortized O(1) time and pessimistic O(n) time.

It's not abuse - many algorithms and data structures have different performance in average and worst case - for example quicksort is O(n log n) but in worst case it's O(n^2). People just ignore the worst case unless worst-case guarantees are especially important for their system.

pwildani · on Dec 10, 2019

The actual analysis of inserts are that they take "amortized O(1) with a single operation worst case of O(N)"

So N operations take N measurement units as N goes to infinity, but any individual operation may be slower or faster.

loeg · on Dec 10, 2019

Growing hash tables only takes amortized O(1); you just do two operations per logical operation, which is still O(1). One search term is "incremental resizing."

https://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing

HelloNurse · on Dec 10, 2019

In practical applications, costs are often not amortized.

If lucky transactions are more than fast enough but unlucky transactions time out because they perform a rebuild it's a defect that cannot be accepted because average performance is good.

Most databases need to take care of worst case performance with techniques that increase complexity and degrade best-case and aggregate performance (e.g. maintaining two hash tables, with lookup accessing both, insertions into the preferred one, and background threads that move records from the old table to the new table).

loeg · on Dec 10, 2019

I may have misspoken and should have left out "amortized," perhaps. It doesn't sound like you're familiar with incremental resizing or other real-time techniques?[1]

> Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.

[1]: https://en.wikipedia.org/wiki/Hash_table#Alternatives_to_all...

Thorrez · on Dec 10, 2019

It's amortized O(1) even without any special incremental resizing. I think the goal of incremental resizing is to get closer to actual O(1).

But I'm not sure incremental resizing can get to actual O(1). Is the malloc() call for the new memory O(1)? And then after malloc() wouldn't the memory have to be initialized somehow (maybe just 0-initialized)?

LessDmesg · on Dec 10, 2019

malloc() isn't O(1), but allocations in a normal GC are O(1).

Thorrez · on Dec 10, 2019

Is there some reason that malloc() would be slower asymptotically than a GC? Why couldn't whatever technique is used for the fast GC be used for malloc() as well?

taejo · on Dec 10, 2019

GCs are typically allowed to (and often do) move objects, while malloc/free do not. This means that malloc/free must maintain data structures tracking allocated vs. free space, while compacting GCs keep all the free space in a single contiguous address space, so allocating an object just increments the "start of free memory" pointer.

cesarb · on Dec 10, 2019

Of course, that just moves the cost from the allocation side to the GC side. This means that the GC must maintain data structures tracking the type of each object (to know which fields are references), the root references, and several others (like card tables) for an efficient implementation. For malloc/free, releasing an object can just add it to a free list (a pair of pointer writes), while a compacting GC has to update all references pointing to the live objects it moves.

kjeetgill · on Dec 10, 2019

Yep! And this a pretty specific design element that I don't think gets appropriate attention when introducing GCs as a concept. There's a big difference between saying malloc vs. gc is doing "more work" or "less work" implying faster and slower.

Doing more work but elsewhere can (possibly) make your program faster. GC's often eat more CPU and RAM which hurts when you need more of either. But C/C++ can occasionally put malloc pressure on your critical path hurting you when you have plenty of resources where a GC might actually help you.

If it was any simpler it wouldn't be so damn interesting and fun.

tom_ · on Dec 10, 2019

malloc implementations often involve some kind of a search to find a free area that can satisfy the request. See http://www.gii.upv.es/tlsf/ for an O(1) implementation - use a large number of free lists, one per size, using bit scan instructions on the block size to select the list to use. To malloc, grab the first entry, and put any remaining portion on the appropriate free list.

To free, add the block to the appropriate free list. Assume blocks participate in a doubly-linked list of all blocks, used and free alike, in address order - adjacent free blocks can be coalesced at this point.

alecco · on Dec 10, 2019

Relational databases do mostly range queries. (e.g. SELECT a WHERE b > ?). Hash tables suck for that since in general they reshuffle (hash) keys. In tree structures 1m key range access is 1 random lookup and 1m sequential (cache-friendly). In hash tables that becomes 1m random lookups and not cache friendly.

See section IV 4 E, "range queries" in "A comparison of adaptive radix trees and hash tables" (2015)

https://15721.courses.cs.cmu.edu/spring2019/papers/08-oltpin...

Even for the best hash table-based indexing B+Trees win. And combining this with the big cost of resizing (growing) or rehashing a table (rebalancing, i.e. Cuckoo) they are not useful.

Perhaps some adaptive hashing method could take care of this in the future.

geophile · on Dec 10, 2019

I don't think this is correct (about mostly range queries). Scan the code base of any application, and I predict you will see 90%+ queries not involving an inequality. That is for OLTP. For analytics, the answer might be different.

branko_d · on Dec 10, 2019

Yup, aligning the B-Tree nodes to database pages (which are themselves aligned to physical disk blocks) is the key.

No pun intended ;)

narnianal · on Dec 10, 2019

> In a database system, the concern was traditionally to minimize page accesses

Is it really about memory access or disk loads?

tempguy9999 · on Dec 10, 2019

Disk.

Though btrees are friendly to cpu caches too, but really it's about minimising disk accesses.

mrfusion · on Dec 10, 2019

Why can’t hash tables shrink and grow gracefully?

jameshart · on Dec 10, 2019

Because items are placed into buckets based on their hash, and as you add or remove buckets you need to redistribute items.

I guess it depends what your definition is of ‘graceful’.

geophile · on Dec 10, 2019

A b-tree grows and shrinks gracefully because each update modifies O(log n) pages. And the base of the logarithm is pretty big.

A typical hash table is not graceful in this way, because while each insert is O(1), you occasionally need to rehash everything into a bigger table, and that is O(n). As I noted above, linear and extensible hashing avoid this problem. An insert that forces a reorg just reorgs one bucket.

tanilama · on Dec 10, 2019

It might be down to memory/space allocation after all.

All hashtable requires rehashing up to certain fill rate, otherwise it will regress to linear look up. For the two most popular implementations, linked-list based or open addressing based, the common strategy for rehashing is to create another instance of hash table and copy the KV over. But on disk, this will be disastrous.

geophile · on Dec 10, 2019

They can. See linear hashing and extensible hashing.