Breaking text into sub-word units... 1. Greatly reduces memory usage. Instead of...

shagie · on April 5, 2023

The walk example doesn't quite hold up.

If you put:

    test walk walker walking walked

into the tokenizer you will see the following tokens:

    [test][ walk][ walk][er][ walking][ walked]

Only walker is broken up into two different tokens.

I added "test" to that because walk at the start doesn't include the leading space and [walk] and [ walk] are different tokens.

For even more fun, [walker] is a distinct token if it doesn't include the leading space.

    test walker floorwalker foowalker

becomes:

    [test][ walk][er][ floor][walker][ fo][ow][alker]

How we think of words doesn't cleanly map to tokens.

(Late edit)

    walker floorwalker

becomes tokenized as:

    [walker][ floor][walker]

So in that case, they're the same token. It's curious how white space influences the word to token making.

nonfamous · on April 5, 2023

There’s no syntax or structure to the token set. The actual tokens were algorithmically selected based on the training data to (putting things loosely) optimize compression of the training data given a token set size.

PeterisP · on April 6, 2023

Sure, but what I'm hearing in the parent post is a question about why we don't use linguistically motivated subword units (of similar length/vocabulary size and thus memory usage) e.g. cutting across morpheme boundaries instead of whatever an algorithm like BPE caclulates.

LeonB · on April 6, 2023

Gratitudes for the explanatories, trousering this for rethinkalysing tomorn.