1. Greatly reduces memory usage. Instead of memorizing every inflection of the word "walk", it memorizes the root (walk) and the modifiers (ing, ed, er, ...). These modifiers can be reused for other words.
2. Allows for word compositions that weren't in the training set.
This is great for uncommon or new expressions like "googlification" or "unalive".
There’s no syntax or structure to the token set. The actual tokens were algorithmically selected based on the training data to (putting things loosely) optimize compression of the training data given a token set size.
Sure, but what I'm hearing in the parent post is a question about why we don't use linguistically motivated subword units (of similar length/vocabulary size and thus memory usage) e.g. cutting across morpheme boundaries instead of whatever an algorithm like BPE caclulates.
1. Greatly reduces memory usage. Instead of memorizing every inflection of the word "walk", it memorizes the root (walk) and the modifiers (ing, ed, er, ...). These modifiers can be reused for other words.
2. Allows for word compositions that weren't in the training set. This is great for uncommon or new expressions like "googlification" or "unalive".