The Trie: A Neglected Data Structure

tmoertel · on Sept 13, 2013

The original article uses Java, which makes it harder to see how simple it can be to make and use prefix trees. For comparison, here's the Python code to create a prefix tree from a set of words:

    def prefix_tree(words):
        d = dict()
        for word in words:
            root = d
            for c in word:
                root = root.setdefault(c, {})
            root['$$'] = {}  # mark end of word with '$$' token
        return d

For example:

    >>> prefix_tree('foo bar baz'.split())
    {'b': {'a': {'r': {'$$': {}},
                 'z': {'$$': {}}}},
     'f': {'o': {'o': {'$$': {}}}}}

To query whether a word is present in a tree you basically compute

    reduce(dict.__getitem__, word, tree)

and test whether the result (a forest) has an end-of-word '$$' tree.

leif · on Sept 13, 2013

Tries (actually pronounced "trees" or "Patricia trees") are very memory efficient compared to most alternatives, but really only shine in membership query workloads. If you actually need to return the stored element, you have to rebuild it for every query, which is probably too expensive if you have lots of queries, it would be better to store the whole object continuously somewhere in the data structure so you could return a reference to it.

Even then, tries only win if the distribution is suitable to give you the memory efficiency you are hoping for. If you don't have many common prefixes, you're better off with just a hashtable.

nly · on Sept 13, 2013

If your key type is lexically sorted, you can also do prefix searches in a binary search tree.

barrkel · on Sept 13, 2013

Prefix searches and range queries in a large corpus with a lot of shared prefixes - a file system listing cache - have been my primary reason for using tries.

You can answer questions like "how many files does this directory have, immediately in it" and also "under it recursively" relatively quickly, and if you use a Patricia trie with a little cleverness in the compression of the text, it needn't take up much space at all even for millions of paths. Whereas a binary tree isn't so great at the memory saving bit.

acbellini · on Sept 13, 2013

Yes, I used that. It's what the ceiling() method in the binary tree in the article does, it looks for the prefix. Still, it's slower :-)

acbellini · on Sept 13, 2013

I have reported the "try" pronunciation because that's the way I have learnt them, and also because that's how Sedgewick's book explicitly uses, but I guess there are alternatives :-)

Anyway, you just got right the point that I was trying to make, a trie is definitely not a general purpouse data structure, but for the specific case (limited number of elements, many common prefixes etc) it performs very well. As for the point of returning the original object, it suffices to store the original object in end-word nodes. I didn't do that in this code because it would have added unnecessary code, since the problem was really just membership and prefixes.

In the specific case, hashtable wouldn't have worked, since it's not sorted and it doesn't solve the original problem.

chatman · on Sept 13, 2013

For set membership checks, bloom filters are way more memory efficient. And almost accurate. Lots of bloom filters of different sizes holding the same data can bring down the false positives to negligible.

GhotiFish · on Sept 13, 2013

Is having bloom filters of sizes 1024, 2048, and 4096 more dependable than one bloom filter of 8192?

That is surprising to me.

leif · on Sept 13, 2013

It's a similar strategy to having one bloom filter with many hash functions, but the math is a little different. Nearly the same effect though.

zheng · on Sept 13, 2013

Thanks, while I think the pronunciation is lame and confusing, I grow tired of explaining to people that it is actually correct, they aren't called "trys" or "treys".

nly · on Sept 13, 2013

Pronunciation can be a very individual or regional thing when it comes to technical abbreviations. I just finished watching the Channel9 GoingNative talks (C++), and was surprised to find a couple of the most respected C++ guys pronounce 'ptr' as 'put-er' (not 'putter') instead of 'pointer'.

Personally I'll continue to pronounce it as 'try'. It has, at least, fewer conflicting interpretations in a programming context. When I say 'tree', people will probably assume I mean a binary tree, not a radix tree. So, if I'm not going at 30% light speed, I'll just say that.

leif · on Sept 13, 2013

Mispronouncing variables is great fun. We have lots of "txn_XXX" which we pronounce "texan_XXX", and "le_XXX" (short for "leafentry XXX", as in "le_key"), which always make us sound like we're mocking French people.

nly · on Sept 13, 2013

Now you're getting it. I also pronounce 'char' from C as if it's short for charred, rather than character. Principally because I find it easier easier to say 'char star' with that vocalisation. I know one guy who, if speaking quickly, would pronounce 'char* sugar' as 'caster sugar'... come to think of it, I'm not sure how Americans pronounce 'caster', but in the UK it's car-stir sugar. Not 'Casper', as in the friendly ghost, but with a 't'.

I've never had a problem with people using different pronunciations. It certainly keeps talking about code fresh.

bjeanes · on Sept 13, 2013

cache is a fun one (cash or kay-sh?). Most Australians I know (and possibly people from the UK too) say it as "cash"

DigitalJack · on Sept 13, 2013

Under what rules of pronunciation would it be Kay-sh?

I recognize pronunciation is more about exceptions than rules, but even so...

gruseom · on Sept 13, 2013

Heh. When we have a variable name with a question mark to denote a boolean, as in "ready?", we pronounce the question mark "what" like an old-school English gentleman—so, "readywhat", or rather: "Ready, what?" It still cracks me up sometimes.

jloughry · on Sept 13, 2013

I was taught by Evi Nemeth (requiescat in pace) that trie pronounced like "try" and to use it that way.

leif · on Sept 13, 2013

The reason is because the name comes from the word "reTRIEval," which was a pretty dumb move in the first place.

scott_s · on Sept 13, 2013

The great thing about pronunciation is that whatever most people say, wins. I'm going to keep using the pronunciation that rhymes with "dyes" so we can avoid confusion with the more general structure.

Kiro · on Sept 13, 2013

They are called "trys" because that's how most people pronounce it.

tptacek · on Sept 14, 2013

I just call them all radix trees.

AaronFriel · on Sept 13, 2013

You're only looking at one kind of trie, there are also Judy arrays, which have much better performance characteristics and (if we're abusing the notation) the same worst-case memory complexity.

jemfinch · on Sept 13, 2013

Judy arrays are a straightforward, highly optimized implementation of a HAMT.

Unfortunately, they were highly optimized for 32-bit machines.

chatman · on Sept 13, 2013

A trie is a poor data structure in practice. There is no locality of reference that can be exploited: to lookup a single key, almost the entire length of the memory blocks occupied by the trie needs to be accessed, and almost at random (depending on the insertion technique). For keys inserted later on, the number of different blocks of memory accesses needed for a single key is proportional to the length of the key.

A B-Tree is better due to better exploitation of locality of reference in memory.

swannodette · on Sept 13, 2013

It depends on the kind of trie that you build - it's worth taking a look at Hash Array Mapped Tries http://lampwww.epfl.ch/papers/idealhashtrees.pdf

The Clojure programming language is completely designed around several immutable variants.

judofyr · on Sept 13, 2013

See also: HAT-trie by Askitis and Sinha.

Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not however, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components.

fogleman · on Sept 13, 2013

Should mention that tries can be heavily compressed into DAWGs (Directed Acyclic Word Graphs) by eliminating common subtrees. I've used this in word games on iOS where I want a small memory footprint.

P.S. I love tries!

nathell · on Sept 13, 2013

"How to Squeeze a Lexicon" by Ciura and Deorowicz [1] is a good and practical introduction to DAWGs.

[1]: http://sun.aei.polsl.pl/~mciura/publikacje/lexicon.pdf

tieTYT · on Sept 13, 2013

I think the article was well written. This is the standard example to use for a Trie. But I never run into any scenarios except for this that make me think, "A Trie is a good fit for this!" What are some other examples where Tries come in handy?

tmoertel · on Sept 13, 2013

I've used prefix trees to represent matching rules for network-routing rulesets. They're also handy when you need to compute intersections because you can perform a parallel depth-first search over two prefix trees and prune from the search any nodes that don't have a corresponding node in the other tree.

For a fun example of this last application, there was a recent Google Code Jam problem called "Alien Languages" [1]. My solution [2] basically counts the leaf nodes in the intersection of two prefix trees. (Note that we can compute the count on the fly during the search and need not actually construct the intersection.)

[1] http://code.google.com/codejam/contest/90101/dashboard#s=p0

[2] https://github.com/tmoertel/practice/blob/master/google-code...

zb · on Sept 13, 2013

When I was developing firmware for Layer 3 network switches we made very heavy use of Patricia Tries (it was originally pronounced like 'tree', BTW... stupid name) for the management of routing tables. They're a good fit because you typically have comparatively long common prefices, especially within your own network.

jdmichal · on Sept 13, 2013

I think the alternate name for tries, radix trees, provides a good hint to potential use-cases. Any problem where the data can be broken down and searched by radix would be a potential good fit. You could even break down data structures with a lot of limited-enumeration fields in a similar way, using a different field at each depth, and gaining deduplication for each node which follows the same path through the structure.

zem · on Sept 13, 2013

word games are definitely the big one [a dawg, which is a packed form of a trie, is even better; it trades off space efficiency for being 'final' in that you cannot modify it once packed. http://www.wutka.com/dawg.html].

but i've used a trie once to good effect to replace a red-black tree with a depth-3 trie whose leaves were red-black trees, trading off the increased space usage for the ability to update the rbtree entries in parallel.

supo · on Sept 13, 2013

With a basic Trie, you always have this problem of how to keep the pointers at each node. Naively, you can:

1) Keep an array of size equal to the alphabet size, indexed by the characters, holding pointers to successive nodes.

- but this blows the memory to O(N * alphabet_size) where N is the input size

2) Keep a linked list of pairs (character, pointer).

- but this blows the lookup time to O(word_length * alphabet_size)

3) Keep something like an STL map<character, pointer>.

- still suboptimal complexity of O(word_length * log alphabet_size) + it complicates the code + it dramatically increases the constant time per operation

Or you research a bit more and actually learn a way to properly implement a trie, called a Ternary Search Tree: http://en.wikipedia.org/wiki/Ternary_search_tree

eeperson · on Sept 13, 2013

Or even better, Array Mapped Tries: http://infoscience.epfl.ch/record/64394/files/triesearches.p...

anaphor · on Sept 13, 2013

Or you decide that your problem is small enough that the poor memory complexity doesn't matter much.

supo · on Sept 13, 2013

Sure, but it can blow up faster than you think. Say you have 100k URL strings, up to 128 chars each, with 64 allowed characters in alphabet.

- storing this at byte per char = 12.8MB

- storing this in a basic trie = 12.8 * 64 = 820MB

- storing this in the TST = 12.8 * 3 = 38MB

anaphor · on Sept 13, 2013

Sure, I was thinking on the order of <100 strings around 5-10 characters each at most. I agree that the TST is much better suited for large sets of strings. Also I think more developers should actually do the calculations even if they only have an estimate of how large the inputs will be.

supo · on Sept 13, 2013

Ah, in that case it goes straight to set<string> for me :-) I think that in practice you want to do that but then drop in a few asserts to check whether things are not getting out of control once you forgot about this decision.

jdmichal · on Sept 13, 2013

Is this taking into account any of the memory savings that would be seen from common paths using the same memory? If a lot of entries have similar prefixes, this can lead to a lot of deduplication.

supo · on Sept 13, 2013

Depends on the randomness of the paths. With URLs coming from a limited set of websites - yes, there would be some savings.

With a set of random strings, for alphabet with 64 chars, there are 16M different 4 character prefixes, so the savings from overlapping prefixes for 128char long strings is likely less than 3%.

k3n · on Sept 13, 2013

John Resig did a really cool series[1][2][3] of deep-dive posts into optimization in JS for mobile, which covered tries (among others) -- it was the first I'd heard of them. Fascinating read.

1. http://ejohn.org/blog/dictionary-lookups-in-javascript/

2. http://ejohn.org/blog/javascript-trie-performance-analysis/

3. http://ejohn.org/blog/revised-javascript-dictionary-search/

dnr · on Sept 13, 2013

If you like tries, take a look at the double-array trie: it's a pretty nifty (though somewhat complicated) way to greatly reduce the space usage of tries by interleaving nodes in the same block of memory. It also makes it easier to serialize a trie, so you can quickly load a pre-built one instead of re-doing all the inserts, which is useful for static dictionaries.

http://linux.thai.net/~thep/datrie/

GhotiFish · on Sept 13, 2013

Thanks for the link! I wish I'd known about this implementation.

munificent · on Sept 13, 2013

Tries haven't been entirely neglected. Quadtrees and octrees are their 2- and 3D analogues, and those are very common in games, renderers and other spatial simulations.

Scaevolus · on Sept 13, 2013

Quadtrees and octrees are far more related to binary space partitioning than tries.

pathikrit · on Sept 13, 2013

[DAWGs](http://en.wikipedia.org/wiki/Directed_acyclic_word_graph) are a special kind of Trie where similar child trees are compressed into single parents. I extended modified DAWGs and came up with a nifty data structure called ASSDAWG (Anagram Search Sorted DAWG). The way this works is whenever a string is inserted into the DAWG, it is bucket-sorted first and then inserted and the leaf nodes hold an additional number indicating which permutations are valid if we reach that leaf node from root. This has 2 nifty advantages:

Since I sort the strings before insertion and since DAWGs naturally collapse similar sub trees, I get high level of compression (e.g. "eat", "ate", "tea" all become 1 path a-e-t with a list of numbers at the leaf node indicating which permutations of a-e-t are valid). Searching for anagrams of a given string is super fast and trivial now as a path from root to leaf holds all the valid anagrams of that path at the leaf node using permutation-numbers.

anaphor · on Sept 13, 2013

I just used a Trie yesterday to implement some string matching that I needed to do in a tokenizer :) It's quite useful if you want to do string matching and you know ahead of time the exact list of strings you need to match with. My only issue with this article is that it explains them using Java, which I think just obscures the explanation too much, but I guess it's a Java blog.

ArbitraryLimits · on Sept 13, 2013

Some people say to use a perfect hash function in that case, IMO it's more trouble than it's worth.

nly · on Sept 13, 2013

Did you need full prefix and/or suffix string matching?

anaphor · on Sept 13, 2013

All I needed to do was match prefixes, and now that I think about it, what I actually did was closer to a ternary search tree. The problem was to match operators after matching an identifier, e.g. say the language has these operators ['+', '++', '%'], and you want to be very permissive with what's allowed to be a normal identifier, so "$$#3" is matched as an identifier, but say you have "b++45", then you want to be able to match ++ correctly. There are lots of ways I could've done it, but a trie-like data structure seemed like the easiest way to allow adding more operators easily.

tehwalrus · on Sept 13, 2013

The Dasher input method (aimed at people with disabilities who are able to enter data only by manipulating a pointer) uses a visualized trie with vertex areas proportional to linguistic probability:

http://www.inference.phy.cam.ac.uk/dasher/DasherSummary2.htm...

(Prof. Mackay, its author, uses it to teach information theory and compression, among other things.)

andrewparker · on Sept 13, 2013

Implementing the Trie (on paper, in C) was about 50% of my final exam on my intro to CS undergrad class at Stanford. We had never seen Tries befor that point, but we had done a variety of trees, so we had the necessary prior experience. It was a fun exam and a very memorable experience.

supo · on Sept 13, 2013

You went to Stanford and have never seen a Trie in highschool? There are a few highschools in Bratislava, Slovakia that teach basic data structures in their Intro to programming courses.

Pxtl · on Sept 13, 2013

North American high-school programming classes were not known for quality. Programming was not generally a high priority item for the school boards so teachers were often completely left on their own for curriculum, and often were amateur programmers at best (if programmers at all).

The end result is that university CS programs are run with the assumption that highschool students had zero prior experience with programming.

There's a reason that so many software geeks are hardcore libertarians and Bill Gates was fighting to reform teaching into a meritocracy - the educational system basically ignored their core skill-set and so the students were often self-taught. Obviously schools are playing catch-up now, but you can't change the past experiences of two generations of programmers.

rch · on Sept 13, 2013

The computer science AP classes at my high school were taught by the typing teacher, after she took a summer course. The instruction consisted entirely of the slides from that course, followed by time to work independently on pretty much whatever, in Pascal.

elliottcarlson · on Sept 13, 2013

People went to school at different times. When I was in high school there was no intro to programming courses - there were no computers in the school, besides some Mac Classics that were on some teachers desks. Since HN has a wide range of age groups, just because you had something available to you growing up, doesn't mean others did.

Oh how times have changed.

tehwalrus · on Sept 13, 2013

My various schools all had computers everywhere - I didn't learn programming until the 3rd year of my (physics) degree.

In secondary school ("high school"), The only computer subject they taught was IT, and people weren't allowed to take it unless they had a special reason not to take a foreign language. It was rumored to be ridiculously easy.

The UK is supposed to be bringing in a programming GCSE - I'll believe it when I see it.

marssaxman · on Sept 13, 2013

since when is a trie a "basic" data structure!?

RomanA · on Sept 13, 2013

I had to do the same (on paper, in C) for my final exam in CS1 at my Uni. Our professor had prepped us on how to implement Trie (insertion, searches, memory complexity), though.

awda · on Sept 13, 2013

Tries have hardly been neglected. "Path-compressed" tries were very recently added to FreeBSD[0] for use in the vm subsystem.

[0]: http://fxr.watson.org/fxr/source/sys/pctrie.h

sambeau · on Sept 13, 2013

Tries are great for caching URLs & other path-like data. I've found them to be faster (and smaller) than hashes for these purposes in the past.

However, this was back in the days of tiny caches. I suspect they may have lost the edge they had in 2003.

sspiff · on Sept 13, 2013

This contradicts chatman's comment at the top, who claims a trie has poor locality. Since you have some experience with tries, do you have an explanation on how to implement tries with good locality properties?

sambeau · on Sept 13, 2013

I'd take his word for it as my tests are 10 years old!

And my apologies - looking back it was Patricia trees I was using as you can take a pointer to a mid-point in the path: useful for disk caches etc.

I was using them for small dictionaries (mainly symbol table-size structures) comparing them against an early super-fast hash (and I recall for really small dictionaries linear search as I naively assumed that a linear search would win for tiny dictionaries - it didn't). In all my tests then (as I say, 2003, on a 3 year old IBM Thinkpad) the tries won.

nawitus · on Sept 13, 2013

> ~O(1) access time

What's the difference between O(1) and ~O(1)?

fournm · on Sept 13, 2013

I've seen ~O(1) as amortized constant time fairly frequently.

mhurron · on Sept 13, 2013

I've always seen ~ as used for about, so ~1h is somewhere around one hour.

I would read it as close to, but maybe not quite O(1).

msluyter · on Sept 13, 2013

I think the point may be that the fuzziness is already built into big-O notation. That is O(n) + C = O(n), where C is a constant.

nawitus · on Sept 13, 2013

If only set notation was more common with big-O notation. 'O(n) + C = O(n)' can be confusing, whereas 'f(x) ∈ O(n) ∧ f(x) ∈ O(n) + C' makes sense.

jlarocco · on Sept 13, 2013

The set notation you gave is redundant. There's always an implied "+C" in any O(...), so why waste space writing it out multiple times?

Furthermore, using precise set notation would, IMO, make it more difficult to use. A big blob of set notation would not be easy to read for most people.

nawitus · on Sept 13, 2013

>The set notation you gave is redundant. There's always an implied "+C" in any O(...),

No. O(n) and O(n)+C and different sets. However, if an algorithm belongs to the first set, it also belongs to the latter. The confusion comes from using the equality sign instead of the "element of" operator.

>Furthermore, using precise set notation would, IMO, make it more difficult to use. A big blob of set notation would not be easy to read for most people.

No, the only difference is that f(x) = O(n) becomes f(x) ∈ O(n).

jlarocco · on Sept 13, 2013

All big-O has an implied "+C", so O(n) is equivalent to O(n)+C by definition. Where is the confusion?

nawitus · on Sept 13, 2013

>All big-O has an implied "+C"

Can you state this in a mathematically precise notation?

>Where is the confusion?

The confusion comes from stateents like 'O(n)+C = O(n)', which are actually false. O(n)+C and O(n) are different sets. If a function belongs to the latter set, it also belongs to the former, but that doesn't mean they're equal.

Using the equal sign here is "abuse of notation". 'O(n)+C = O(n)' would seem to implicate that C=0, but that's untrue. C is a constant which can be non-zero. Therefore the whole statement doesn't really make sense unless we assign a new definition to the equal sign.

Using set notation solves all the confusion.

'Implied +C' is pretty much nonsense. There are an infinite number of different sets that a single function belongs to. Just because f(x) belongs to O(n) and therefore to O(n)+C doesn't mean that 'big-O implies +C'. 'big-O' doesn't imply anything like that. '+C' is just a special case, and there's an infinite number of extra 'implies'.

anonymoushn · on Sept 14, 2013

http://en.wikipedia.org/wiki/Big_O_notation#Formal_definitio...

A function f(x) is a member of the set O(n) iff there is some M,x0 such that for all x>x0, |f(x)|<=Mn. This means functions like lg n, n+30, n+1e999, 1e999n + sqrt(n), sin(n), and 7 are all members of the set O(n).

It is difficult to give a formal proof that O(n) + C = O(n) because there isn't really a formal notion of what is meant by O(n) + C. To me, that looks like the result of adding a real constant to a set of functions. It is common to see people do algorithmic analysis and replace terms in a runtime with a set they belong to, in which case an expression like O(n) + C might appear, but what this means is just "some function that is in O(n) plus some constant," and it is simple to prove that the set of all functions that can be described in this way is equal to the set O(n). O(n) is closed under the operation of adding constants, and the "adding constants" operation is its own inverse.

dragonwriter · on Sept 13, 2013

> O(n)+C and O(n) are different sets.

Yes, and no. O(n) isn't a set until you specify the limit that goes with it; for some limits O(n) is different from O(n)+C -- the latter of which is equivalent to O(n+C). However, in CS, O(n) with no specified limit refers by convention to the set O(n) as n->infinity, and O(n) as n->infinity is equal to O(n+C) as n->infinity, because the limit as n->infinity of n is equal to the limit as n->infinity of n+C.

jlarocco · on Sept 13, 2013

> Can you state this in a mathematically precise notation?

Big-O isn't meant to be precise. The entire point is to roughly describe algorithm behavior.

> 'Implied +C' is pretty much nonsense.

Implied +C is built in to the definition of big-O notation. As n increases towards infinity the influence of C becomes negligible, so it's left off. Read the first chapter of any algorithms book.

eggnet · on Sept 13, 2013

If I recall correctly, more generally O(A*f(n)+B) = O(f(n)) where A and B are constant.

nawitus · on Sept 13, 2013

But if an algorithm is O(0.5), then it also belongs to O(1) and O(5), e.g. the time is constant and doesn't change with the problem size.

__alexs · on Sept 13, 2013

I imagine that was the intent but the idea of something that is only approximately worst case constant time is quite odd. Does that mean it's only ever so slightly O(N)? O(N/BIGNUM) Except as I understand it that would still normally be written O(N)...

_f8tq · on Sept 13, 2013

the tilde notation is used for functions, e.g. you can say f(n) is ~g(n) and in that context has a rigorous meaning (limit of f(n)/g(n) is 1 as n goes to positive infinity). I haven't seen it being used with orders of growth notations (O, Θ or Ω) and my guess is that's ~O(1) is undefined or somebody's just taking liberties with the notations.

ape4 · on Sept 13, 2013

This blog post appears to be to get the author's resume looked at. Seems to be working.

macNchz · on Sept 13, 2013

17 years of both Java and JavaScript, now ain't that something...