How Language Models Beat PNG and FLAC…

Sep 27, 2023

A detailed analysis of the DeepMind/Meta study: how large language models achieve unprecedented compression rates on text, image, and audio data - and the implications of these results

Read →

8 Comments

Gregoire Deletang

Sep 29, 2023

Very cool, from the first author!

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 29, 2023Edited

Wow, I did not expect this. Thank you so much :-)

It was a very interesting paper. For a person who is not doing research, I picked up quite a few new insights about LLMs in the process of reading it.

Expand full comment

Reply (1)

Gregoire Deletang

Sep 29, 2023

Your insights were very interesting even for me ! I hope you enjoyed learning about the beauty of arithmetic coding and compressing with it. One comment: On tokenization, I think you missed that the real usage of it is just to reduce the context length by a lot. Using raw ascii seems to work better in general (in terms of prediction loss ie compression), but very impractical.

Expand full comment

Reply (1)

Abhinav Upadhyay

Sep 29, 2023

I did, indeed.

I have to admit, I did not dig very deep into tokenization, so I could be wrong.

But my understanding was that it generally improves the model's (predictive) performance. For example, the "Language Models are Unsupervised Multitask Learners" paper in section 2.2 mentions that they found that using a byte level tokenizer on WebText did not perform as well, and they chose BPE. Is that a wrong interpretation?

I agree that in your results it is visible that ASCII tokenizer resulted in best compression rates across all the models. Which also means the models had better prediction loss using ASCII.

Expand full comment