add gptq readme

pull/538/head
humu789 2023-05-17 19:00:17 +08:00
parent a7a35debc0
commit 2df570b241
8 changed files with 131 additions and 31 deletions

View File

@ -0,0 +1,56 @@
# GPTQ
> [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323)
<!-- [ALGORITHM] -->
## Abstract
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highlyaccurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
## Usage
GPTQ is easy to use in mmrazor. You can use it like this:
```python
from mmrazor.implementations.quantization import gptq
# initial model, dataloaders
model
train_loader, test_loader
## init gptq compressor and prepare for quantization
compressor = gptq.GPTQCompressor()
compressor.prepare(model)
## get hessian matrix
compressor.init_hessian()
compressor.register_hessian_hooks()
infer(model, test_loader, num_samples=num_samples)
compressor.remove_hessian_hooks()
## quant
compressor.quant_with_default_qconfig()
## to a normal torch model
model = compressor.to_static_model(model)
```
## Full Examples
- [ResNet](../examples/ResNet/README.md)
- [LLaMA](../examples/language_models/LLaMA/README.md)
## Cite
```latex
@misc{
Frantar_Ashkboos_Hoefler_Alistarh_2022,
title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
year={2022},
month={Oct},
language={en-US}
}
```

View File

@ -1,6 +1,10 @@
# SparseGPT
## abstract
> [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774)
<!-- [ALGORITHM] -->
## Abstract
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
@ -19,7 +23,8 @@ train_loader, test_loader
compressor = sparse_gpt.SparseGptCompressor()
compressor.prepare(model)
## init hessian matrix
## get hessian matrix
compressor.init_hessian()
compressor.register_hessian_hooks()
infer(model, test_loader, num_samples=num_samples)
compressor.remove_hessian_hooks()
@ -34,9 +39,9 @@ model = compressor.to_static_model(model)
## Full Examples
- [ResNet](../examples/ResNet/sparse_gpt/README.md)
- [ResNet](../examples/ResNet/README.md)
- [OPT](../examples/language_models/OPT/README.md)
- [Llama](../examples/language_models/Llama/README.md)
- [LLaMA](../examples/language_models/LLaMA/README.md)
## Cite

View File

@ -0,0 +1,25 @@
# Examples for ResNet
## SparseGPT
For more details about SparseGPT, please refer to [SparseGPT](../../algorithms/SparseGPT.md)
### Usage
```shell
python projects/mmrazor_large/examples/ResNet/resnet18_sparse_gpt.py --data {imagenet_path} --batchsize 128 --num_samples 512
```
**Note**: this imagenet folder follows torch format.
## GPTQ
For more details about GPTQ, please refer to [GPTQ](../../algorithms/GPTQ.md)
### Usage
```shell
python projects/mmrazor_large/examples/ResNet/resnet18_gptq.py --data {imagenet_path} --batchsize 128 --num_samples 512
```
**Note**: this imagenet folder follows torch format.

View File

@ -1,11 +0,0 @@
# SparseGPT for ResNet
For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md)
## Usage
```shell
python examples/model_examples/ResNet/sparse_gpt/resnet18_sparse_gpt.py --data {imagenet_path} --batchsize 128 --num_samples 512
```
**Note**: this imagenet folder follows torch format.

View File

@ -1,15 +1,44 @@
# Llama
# Examples for LLaMA
## SparseGPT for LL
## SparseGPT
For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md)
### Usage
```shell
python examples/model_examples/language_models/Llama/llama_sparse_gpt.py -h
usage: llama_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M]
model {wikitext2,ptb,c4}
# example for decapoda-research/llama-7b-hf
python projects/mmrazor_large/examples/language_models/LLaMA/llama_sparse_gpt.py decapoda-research/llama-7b-hf c4
# help
usage: llama_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4}
positional arguments:
model Llama model to load
{wikitext2,ptb,c4} Where to extract calibration data from.
optional arguments:
-h, --help show this help message and exit
--seed SEED Seed for sampling the calibration data.
--nsamples NSAMPLES Number of calibration data samples.
--batch_size BATCH_SIZE
Batchsize for calibration and evaluation.
--save SAVE Path to saved model.
-m M Whether to enable memory efficient forward
```
## GPTQ
For more details about GPTQ, please refer to [GPTQ](../../../algorithms/GPTQ.md)
### Usage
```shell
# example for decapoda-research/llama-7b-hf
python projects/mmrazor_large/examples/language_models/LLaMA/llama_gptq.py decapoda-research/llama-7b-hf c4
# help
usage: llama_gptq.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4}
positional arguments:
model Llama model to load
@ -23,7 +52,4 @@ optional arguments:
Batchsize for calibration and evaluation.
--save SAVE Path to saved model.
-m M Whether to enable memory efficient forward
# For example, prune decapoda-research/llama-7b-hf
python examples/model_examples/language_models/Llama/llama_sparse_gpt.py decapoda-research/llama-7b-hf c4
```

View File

@ -1,15 +1,17 @@
# OPT
# Examples for OPT
## SparseGPT for OPT
## SparseGPT
For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md)
### Usage
```shell
python examples/model_examples/language_models/OPT/opt_sparse_gpt.py -h
usage: opt_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M]
model {wikitext2,ptb,c4}
# example for facebook/opt-125m
python projects/mmrazor_large/examples/language_models/OPT/opt_sparse_gpt.py facebook/opt-125m c4
# help
usage: opt_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4}
positional arguments:
model OPT model to load; pass `facebook/opt-X`.
@ -23,7 +25,4 @@ optional arguments:
Batchsize for calibration and evaluation.
--save SAVE Path to saved model.
-m M Whether to enable memory efficient forward
# For example, prune facebook/opt-125m
python examples/model_examples/language_models/OPT/opt_sparse_gpt.py facebook/opt-125m c4
```