A detailed analysis of the DeepMind/Meta study: how large language models achieve unprecedented compression rates on text, image, and audio data - and the implications of these results
Your insights were very interesting even for me ! I hope you enjoyed learning about the beauty of arithmetic coding and compressing with it. One comment: On tokenization, I think you missed that the real usage of it is just to reduce the context length by a lot. Using raw ascii seems to work better in general (in terms of prediction loss ie compression), but very impractical.
I have to admit, I did not dig very deep into tokenization, so I could be wrong.
But my understanding was that it generally improves the model's (predictive) performance. For example, the "Language Models are Unsupervised Multitask Learners" paper in section 2.2 mentions that they found that using a byte level tokenizer on WebText did not perform as well, and they chose BPE. Is that a wrong interpretation?
I agree that in your results it is visible that ASCII tokenizer resulted in best compression rates across all the models. Which also means the models had better prediction loss using ASCII.
Very cool, from the first author!
Wow, I did not expect this. Thank you so much :-)
It was a very interesting paper. For a person who is not doing research, I picked up quite a few new insights about LLMs in the process of reading it.
Your insights were very interesting even for me ! I hope you enjoyed learning about the beauty of arithmetic coding and compressing with it. One comment: On tokenization, I think you missed that the real usage of it is just to reduce the context length by a lot. Using raw ascii seems to work better in general (in terms of prediction loss ie compression), but very impractical.
I did, indeed.
I have to admit, I did not dig very deep into tokenization, so I could be wrong.
But my understanding was that it generally improves the model's (predictive) performance. For example, the "Language Models are Unsupervised Multitask Learners" paper in section 2.2 mentions that they found that using a byte level tokenizer on WebText did not perform as well, and they chose BPE. Is that a wrong interpretation?
I agree that in your results it is visible that ASCII tokenizer resulted in best compression rates across all the models. Which also means the models had better prediction loss using ASCII.
Yes I love the depth of data !
Thank you, Charlie!
Amazing! @Abhinav your content is always top notch. The best in our field that I have come across on substack.
Thanks, Mir! You are always very kind. I love your content as well, always something new to learn.