Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Breaking text into sub-word units...

1. Greatly reduces memory usage. Instead of memorizing every inflection of the word "walk", it memorizes the root (walk) and the modifiers (ing, ed, er, ...). These modifiers can be reused for other words.

2. Allows for word compositions that weren't in the training set. This is great for uncommon or new expressions like "googlification" or "unalive".



The walk example doesn't quite hold up.

If you put:

    test walk walker walking walked
into the tokenizer you will see the following tokens:

    [test][ walk][ walk][er][ walking][ walked]
Only walker is broken up into two different tokens.

I added "test" to that because walk at the start doesn't include the leading space and [walk] and [ walk] are different tokens.

For even more fun, [walker] is a distinct token if it doesn't include the leading space.

    test walker floorwalker foowalker
becomes:

    [test][ walk][er][ floor][walker][ fo][ow][alker]
How we think of words doesn't cleanly map to tokens.

(Late edit)

    walker floorwalker
becomes tokenized as:

    [walker][ floor][walker]
So in that case, they're the same token. It's curious how white space influences the word to token making.


There’s no syntax or structure to the token set. The actual tokens were algorithmically selected based on the training data to (putting things loosely) optimize compression of the training data given a token set size.


Sure, but what I'm hearing in the parent post is a question about why we don't use linguistically motivated subword units (of similar length/vocabulary size and thus memory usage) e.g. cutting across morpheme boundaries instead of whatever an algorithm like BPE caclulates.


Gratitudes for the explanatories, trousering this for rethinkalysing tomorn.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: