Hi I'm Mark I work on torchao which was used for the quantization aware training...

philipkglass · on Oct 24, 2024

What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:

https://huggingface.co/docs/hub/en/gguf#quantization-types

It might even mean a non-GGUF quantization scheme; I'm just an intermediate user of local models, not an expert user or developer.

formalsystem · on Oct 25, 2024

Please ignore my previous comments - I double checked with the model developers and here's the correction. Vanilla PTQ means no fancy quantization algorithm like SpinQuant, AWQ, etc. was applied. It just applied the same quantization scheme mentioned in the post (4bit per-group with g_size=32 symmetric weight, 8bit dynamic per token activation).

formalsystem · on Oct 24, 2024

So this should be referring to w8a8 (weights and activations in 8 bit)

So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization

imjonse · on Oct 25, 2024

Were there comparisons made to AWS, Smoothquant, GPTQ or other non-vanilla PTQ methods? Thanks.

formalsystem · on Oct 25, 2024

Not that I know of for this study, at least for the specific scope torchao we want to make it easier for researchers to create new quantization algorithms in python and have those algorithms run fast and you can see a lot of those algorithms here https://github.com/pytorch/ao/tree/main/torchao/prototype

So for example for AWQ and GPTQ we can accelerate them by using a fast int4 kernel called tinygemm

Evidlo · on Oct 25, 2024

I have a non-ML question.

In vanilla Pytorch I have the following expression:

    t.sum(values[inds] * weights)

If 'inds' is int8, I get "IndexError: tensors used as indices must be long, int, byte or bool tensors".

Is this still true if I use torchao?

formalsystem · on Oct 25, 2024

The issue here is memory in PyTorch is byte addressable and that's a limitation we can't solve without making a lot more changes to PyTorch. But in your specific case, if you'd like to pack more data into `values` you can use a combination of clever bit shifting, torch.cat and other bit twiddling pytorch like ops to pack more data. It's a trick we use quite heavily in torchao

Evidlo · on Oct 25, 2024

Arent int8s byte-aligned though? I thought this restriction was originally motivated by maintenance overhead of having to support more dtypes.

saagarjha · on Oct 25, 2024

Do you ever pronounce torchao in a way that rhymes with "wow"

formalsystem · on Oct 25, 2024

My wife calls it torch AAAW