11 Comments

Hey,

> Firstly, the gzip + KNN approach is lightweight and economical, especially when compared to cumbersome deep learning models.

as you mention later, it is not really economical given it requires a full KNN computation at each inference. In effect, this requires a lot of cpu and a lot of memory. I'm trying to reproduce the results but it seems even reproducing AG_NEWS takes around 24 hours.

> Unlike state-of-the-art models that necessitate training, this model is non-parametric — meaning it does not contain parameters that need to be learned, which significantly reduces costs.

yes, but this misses the idea that some compression algorithms do build a kind of ~parametric model~ when they build a compression dictionnary. The dictionnary is a form of learning and could be re-used at inference.

I tried to build a demo of such approach here:

https://github.com/cyrilou242/ftcc

it uses zstd to build compression dictionnaries.

Overall training + inference is multiple order of magnitudes faster than the gzip approach, and performance seems to be similar.

I'd be curious to have your feedback on this.

Expand full comment

I took the liberty of sharing this on Twitter and it has gone viral there: https://twitter.com/abhi9u/status/1683141215871705088

PS: Someone in the replies mentioned that they also suggested a similar idea on HN - https://twitter.com/doodlestein/status/1683236898779660288

Expand full comment

Hey Abhinav thanks! Funny how compression based classification is all the rage on twitter even though it has pretty poor performance.

I was not aware of the comment on HN. My inspiration comes from here:

https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/

https://engineering.linkedin.com/shared-dictionary-compression-http-linkedin

(dictionary used to improve compression in http protocol),

Perter Norvig AIMA book, and some NLP work I did in the past.

I was originally planning to implement in Java with https://github.com/gtoubassi/femtozip, I have no knowledge of zstd.

I chose python because it seems to be the main language of the NLP community and it made it farily easy to re-use data loaders from the npc_gzip repo.

At the end this is pretty sad, because the most time consuming operation is now to do a simple "\n".join(texts) :/

I'm not claiming I have a new idea here though, just that I'm writing a correct baseline for compression based method, with a decent speed/accuracy ratio.

IMO Knn methods are only relevant if it's possible to build a database that makes it extremely fast and cheap to find Knn. Like with embeddings.

Expand full comment

Thanks for the references Cyril.

Yes, I agree with you that using KNN is only useful if there is someway to index the neighbors and then look them up, rather than do the whole computation every time.

Expand full comment

Yes, I also tried doing this KNN on the AG_NEWS dataset and it did not finish. What I ended up doing was to reduce the dataset by doing a random sampling from each class and creating a reduced dataset for KNN. In the last part of the paper, where they compare Bzip, LZMA and zstd, they mention they used a smaller sample because the the algorithms were too slow. And, from their README also it seems the default is to run it with 100 samples. Even Ken Schutte seems to have given up on trying it on all datasets when he found the bug in their code.

RE: about parameters of the compression algorithm -- I agree that some compression algorithms do build a model of the data, which is similar to learning from the data. Typically these are statistical models, such as Huffman coding, Arithmetic coding or ANS (used in zstd).

However, in all the compression algorithms used in the paper (gzip, lzma, zstd) -- the first stage is LZ77, which is not statistical and does not require building a model of the data. The next stage in these algorithms is some sort of coding. As far as I know gzip applies Huffman coding and zstd seems to apply Huffman + ANS. At this point, much of the data has been transformed by LZ77, so I am not sure how much of the original information is left to say these are building a model of the original data? I have been thinking a lot about this question but can't come up with anyway to reason about it. What are your thoughts on it?

I checked your project. It seems like an effective approach and the performance is remarkable, both in terms of speed and accuracy. Very impressive. However, as you are generating a dictionary by training on the data, I wonder if this will perform as well on an out of distribution data point? Did you try that?

Expand full comment

I've already replied on twitter as well, but I think I have some answers to your questions:

- I did an ablation study where I just use LZ4 (instead of LZ77 because I didn't find an efficient python implementation), and no Huffman coding. In my experimental settings, LZ4 performed just as well, despite having a much lower compression ratio (and with the benefit of running 10x faster). So I am not sure if there is any benefit at all to doing Huffman or ANS.

- This to me sounded as if the method were just about comparing strings between documents (since in the end, a better compression by LZ4 means that more substrings are shared between two documents). If it's just about substring matching, why not pre-compile a set of all substrings belonging to some topic and comparing some query document to that set? That's what I tried and it still gets a comparable accuracy. My take is that the ftcc-code is doing basically the same (in a very neat way and much more faster than I do :)).

Here's a link to my repo: https://github.com/mattminder/npc_gzip

And a write-up: https://towardsdatascience.com/is-it-compression-that-you-need-7aa544a328b1

Expand full comment

Hi Matthias, thank you for sharing your write-up. I went through it.

I agree with your observations 100%. In fact, I have spent much time on trying to understand why gzip worked so well for classification myself, before I wrote this article. My original plan was to write about that but then I decided to write an explainer first.

As per my experiments and understanding of how compression works, this is exactly what is going on. It is LZ77 which is resulting in lower compression for documents with matching patterns, than anything else.

Expand full comment

> If it's just about substring matching, why not pre-compile a set of all substrings belonging to some topic and comparing some query document to that set? That's what I tried and it still gets a comparable accuracy. My take is that the ftcc-code is doing basically the same.

totally agree. in your approach you may want to perform stemming.

(I could do it ftcc too)

My goal was mostly to provide a better baseline for naive compression-based classification which is fast.

I believe a tf-idf done correctly should outperform my approach (and yours sorry). Overall the baselines provided in the npc_gzip paper are not very honest imho.

I'll try to build a solid tf-idf when I have time.

One interesting result about compression is here though:

https://github.com/cyrilou242/ftcc/#compression-analysis

Expand full comment

I didn't do stemming on purpose to remain as close as possible to the original paper (that's why I did character-level n-grams as well), but agree that one would expect it to improve performance.

I am looking forward to hearing if you can outperform the methods with tf-idf. I think it would be interesting if you couldn't (and that this would be the main merit of the paper IMO), because it would suggest that there are some text classification tasks for which giving a lot of weight to the extremes of the word/character/stem distributions performs better than looking at the means. In those cases, I would expect the performance gap between neural nets and "stupid baselines" to be much lower, since the neural nets will need a lot of parameters to model the tails appropriately.

Expand full comment

I have no opinion on to what extent it's building a model of the original data.

One interesting result on this topic may be this one:

https://github.com/cyrilou242/ftcc/#compression-analysis

But I haven't put thoughts into this.

I wonder if this will perform as well on an out of distribution data point? Did you try that?

It should perform pretty poorly.

stemming or tokenizing with a wordpiece-like tokenizer could slightly help on this, but nothing crazy good will happen I guess. I'll try when I find time.

Expand full comment

I wrote a follow-up article on this which dissects various compression algorithms to analyze what exactly is happening during compression leading to unexpected performance in text classification: https://codeconfessions.substack.com/p/lz77-is-all-you-need

Expand full comment