8 Comments

Very cool, from the first author!

Expand full comment

Wow, I did not expect this. Thank you so much :-)

It was a very interesting paper. For a person who is not doing research, I picked up quite a few new insights about LLMs in the process of reading it.

Expand full comment

Your insights were very interesting even for me ! I hope you enjoyed learning about the beauty of arithmetic coding and compressing with it. One comment: On tokenization, I think you missed that the real usage of it is just to reduce the context length by a lot. Using raw ascii seems to work better in general (in terms of prediction loss ie compression), but very impractical.

Expand full comment

I did, indeed.

I have to admit, I did not dig very deep into tokenization, so I could be wrong.

But my understanding was that it generally improves the model's (predictive) performance. For example, the "Language Models are Unsupervised Multitask Learners" paper in section 2.2 mentions that they found that using a byte level tokenizer on WebText did not perform as well, and they chose BPE. Is that a wrong interpretation?

I agree that in your results it is visible that ASCII tokenizer resulted in best compression rates across all the models. Which also means the models had better prediction loss using ASCII.

Expand full comment

Yes I love the depth of data !

Expand full comment

Thank you, Charlie!

Expand full comment

Amazing! @Abhinav your content is always top notch. The best in our field that I have come across on substack.

Expand full comment

Thanks, Mir! You are always very kind. I love your content as well, always something new to learn.

Expand full comment