7 Comments
Aug 4, 2023Liked by Alejandro Piad Morffis, Abhinav Upadhyay

This is all quite true, but tokenization is a very small part of AI, and not the cutting edge path to broad knowledge AI assistants.

Also context is very important which is part of the vector-database where all tokens are vectorized and associated so that sentences can have meaning.

...

The path of tokenization versus vectorization is largely dependent up on the problem.

If your just translating a lot of words, then tokenization, and perhaps compressed tokenization is most useful, especially for running AI on mobile phone apps as predicted in near future, any means of compressing the input stream and saving memory is valuable

On the other hand if your comparing documents, or ask the AI about a deep and long subject with memory the vector-database methods are far better because the meaning of the words is not lost, rather than just enumerating tokens whether as words or syllables

...

Sorry for responding, but this 'all you need' is quite common for something like +5 years now, and nothing is all you need, like the swiss army knife use the right tool for the job; When your only tool is a hammer, all looks like a nail, and compressed tokenization is not all you need;

I don't know if your the github author, but a useful stat on this article is to show the total memory used for each case, so people can clearly see which is best for compressing the stream, also since your using FP32/64, you might want to consider FP8/16 depending on depth of tokens to achieve max compression

I personally prefer two dimensional vector reps, but because I want meaning, e.g. two dimension can show you that man/woman are similar as both are humans, but a one dimensional tokenization may tell you that man/woman are close, but not the common family, for the AI beginner I would suggest just writing your own bag-of-words in python and understanding tokenization and be done with it, as the real meat of AI is in training the weights and doing the minimization of training aka linear-algebra;

I think topic good for advanced optimization, say a person who is trying to push a HUGE app onto a mobile-phone and needs to compress memory

Expand full comment
author

What you are saying is true and relevant. However, this post's objective was to dissect what's happening inside gzip and other compression methods which is helping them perform well on text classification (as shown in the ACL paper). I'm not talking about using LZ77 or other algorithms for general purpose NLP. And even the original paper's idea was to highlight the fact that simpler techniques still work and we don't need to take the big guns (LLMs) out for everything.

Expand full comment

I'm not picking, I should have just ignored, but your 'this is the end' triggered me, that statement is used too often.

Like I said, re-post your article and include real data how much much memory can be saved by using the different algo's also using FP8, FP16, FP32, FP64 for the token's

I think that next year when people start moving all this shit to mobile phones, that you might be able to patent some useful things that if you document now exactly how much memory can be saved;

The compression algo's not only save space, but offer variable length tokenization in a unique model that the AI can learn from;

Just need to document the savings

The issues are

1.) which algo offers the most contrast in token meaning

2.) which algo offers the most compression for long token querys say an 32k prompt on a cell phone asking for a query, but you have 1,000's of querys and ergo memory/storage is precious

Like I said, I prefer vector's because I prefer word association in the entire paragraph, but both tokenization and vectorization have their place, the novelty of your paper is the advanced compression, but you need to specially show which algo has what benefit;

Expand full comment

Sure it's not as good and not an actual functional component you'd likely use in say, a production app. But it proves that there are other ways to approach these problem besides scaling up. I have hope that even tasks we currently only know how to use Transformers or convnets for will end up having 'workarounds' or at least other algorithms. This particular algorithm doesn't illustrate a step on the way to that, but it does speak to the fact that there are plenty of other ways out there waiting for us still to find them and we shouldn't be myopic about the SOTA.

Expand full comment
Aug 4, 2023·edited Aug 4, 2023

The tokenization as in this OP is largely what is used for language translation that is OK, and dandy, say an app that as quick as you can clip a page of an old language text, in any language gets translated, and then to be able to instantly translate into any language offline with no 'nickels going to the gpt-4 meter'; ( or your data being fed to google or palantir open-ai cia spy centers )

RLE run-length-encoding is all linear & deterministic, so it should scale fine; There are 1,000's of algo's, so a shit detector should be AI trained to determine the best algo for particular data, for instance some RLE can compress a photo by 99%, while not useful for text;

...

In 5-10 years space will hardly matter, but today with prevalent 64gb smart-phones, its like deja-vu programming on a 4k ram memory all over again 1970's, where Algo's rule;

For now try to be the first kid on the block with killer AI app's on mobile phones that allow the user to capture and use huge amounts of data making their experience seem infinite.

Vector Tokenization ( database ) is more useful searching paragraphs in similar material involving 1,000's of documents;

Expand full comment

In the coming WAR next year to put LLM models on mobile-phones they first to market will get rich;

The issue is to cut over-head for the weights and the prompts and token lists in&out;

If you can say cut the current token-list say 90% then you can be the first on your block with a lean&mean mobile app that does real AI;

People are gong to want 8K, 16K, 48K prompts, and they will expect to get back same-same 48K of data or more, so the first kids on the block that figure out how to compress all this crap by 90% win.

Expand full comment
Aug 5, 2023Liked by Alejandro Piad Morffis

Thanks Alejandro

Expand full comment