add gptq readme
parent
a7a35debc0
commit
2df570b241
|
@ -0,0 +1,56 @@
|
|||
# GPTQ
|
||||
|
||||
> [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323)
|
||||
|
||||
<!-- [ALGORITHM] -->
|
||||
|
||||
## Abstract
|
||||
|
||||
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highlyaccurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.
|
||||
|
||||
## Usage
|
||||
|
||||
GPTQ is easy to use in mmrazor. You can use it like this:
|
||||
|
||||
```python
|
||||
from mmrazor.implementations.quantization import gptq
|
||||
|
||||
# initial model, dataloaders
|
||||
model
|
||||
train_loader, test_loader
|
||||
|
||||
## init gptq compressor and prepare for quantization
|
||||
compressor = gptq.GPTQCompressor()
|
||||
compressor.prepare(model)
|
||||
|
||||
## get hessian matrix
|
||||
compressor.init_hessian()
|
||||
compressor.register_hessian_hooks()
|
||||
infer(model, test_loader, num_samples=num_samples)
|
||||
compressor.remove_hessian_hooks()
|
||||
|
||||
## quant
|
||||
compressor.quant_with_default_qconfig()
|
||||
|
||||
## to a normal torch model
|
||||
model = compressor.to_static_model(model)
|
||||
|
||||
```
|
||||
|
||||
## Full Examples
|
||||
|
||||
- [ResNet](../examples/ResNet/README.md)
|
||||
- [LLaMA](../examples/language_models/LLaMA/README.md)
|
||||
|
||||
## Cite
|
||||
|
||||
```latex
|
||||
@misc{
|
||||
Frantar_Ashkboos_Hoefler_Alistarh_2022,
|
||||
title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
|
||||
author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
|
||||
year={2022},
|
||||
month={Oct},
|
||||
language={en-US}
|
||||
}
|
||||
```
|
|
@ -1,6 +1,10 @@
|
|||
# SparseGPT
|
||||
|
||||
## abstract
|
||||
> [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774)
|
||||
|
||||
<!-- [ALGORITHM] -->
|
||||
|
||||
## Abstract
|
||||
|
||||
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
|
||||
|
||||
|
@ -19,7 +23,8 @@ train_loader, test_loader
|
|||
compressor = sparse_gpt.SparseGptCompressor()
|
||||
compressor.prepare(model)
|
||||
|
||||
## init hessian matrix
|
||||
## get hessian matrix
|
||||
compressor.init_hessian()
|
||||
compressor.register_hessian_hooks()
|
||||
infer(model, test_loader, num_samples=num_samples)
|
||||
compressor.remove_hessian_hooks()
|
||||
|
@ -34,9 +39,9 @@ model = compressor.to_static_model(model)
|
|||
|
||||
## Full Examples
|
||||
|
||||
- [ResNet](../examples/ResNet/sparse_gpt/README.md)
|
||||
- [ResNet](../examples/ResNet/README.md)
|
||||
- [OPT](../examples/language_models/OPT/README.md)
|
||||
- [Llama](../examples/language_models/Llama/README.md)
|
||||
- [LLaMA](../examples/language_models/LLaMA/README.md)
|
||||
|
||||
## Cite
|
||||
|
||||
|
|
|
@ -0,0 +1,25 @@
|
|||
# Examples for ResNet
|
||||
|
||||
## SparseGPT
|
||||
|
||||
For more details about SparseGPT, please refer to [SparseGPT](../../algorithms/SparseGPT.md)
|
||||
|
||||
### Usage
|
||||
|
||||
```shell
|
||||
python projects/mmrazor_large/examples/ResNet/resnet18_sparse_gpt.py --data {imagenet_path} --batchsize 128 --num_samples 512
|
||||
```
|
||||
|
||||
**Note**: this imagenet folder follows torch format.
|
||||
|
||||
## GPTQ
|
||||
|
||||
For more details about GPTQ, please refer to [GPTQ](../../algorithms/GPTQ.md)
|
||||
|
||||
### Usage
|
||||
|
||||
```shell
|
||||
python projects/mmrazor_large/examples/ResNet/resnet18_gptq.py --data {imagenet_path} --batchsize 128 --num_samples 512
|
||||
```
|
||||
|
||||
**Note**: this imagenet folder follows torch format.
|
|
@ -1,11 +0,0 @@
|
|||
# SparseGPT for ResNet
|
||||
|
||||
For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md)
|
||||
|
||||
## Usage
|
||||
|
||||
```shell
|
||||
python examples/model_examples/ResNet/sparse_gpt/resnet18_sparse_gpt.py --data {imagenet_path} --batchsize 128 --num_samples 512
|
||||
```
|
||||
|
||||
**Note**: this imagenet folder follows torch format.
|
|
@ -1,15 +1,44 @@
|
|||
# Llama
|
||||
# Examples for LLaMA
|
||||
|
||||
## SparseGPT for LL
|
||||
## SparseGPT
|
||||
|
||||
For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md)
|
||||
|
||||
### Usage
|
||||
|
||||
```shell
|
||||
python examples/model_examples/language_models/Llama/llama_sparse_gpt.py -h
|
||||
usage: llama_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M]
|
||||
model {wikitext2,ptb,c4}
|
||||
# example for decapoda-research/llama-7b-hf
|
||||
python projects/mmrazor_large/examples/language_models/LLaMA/llama_sparse_gpt.py decapoda-research/llama-7b-hf c4
|
||||
|
||||
# help
|
||||
usage: llama_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4}
|
||||
|
||||
positional arguments:
|
||||
model Llama model to load
|
||||
{wikitext2,ptb,c4} Where to extract calibration data from.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--seed SEED Seed for sampling the calibration data.
|
||||
--nsamples NSAMPLES Number of calibration data samples.
|
||||
--batch_size BATCH_SIZE
|
||||
Batchsize for calibration and evaluation.
|
||||
--save SAVE Path to saved model.
|
||||
-m M Whether to enable memory efficient forward
|
||||
```
|
||||
|
||||
## GPTQ
|
||||
|
||||
For more details about GPTQ, please refer to [GPTQ](../../../algorithms/GPTQ.md)
|
||||
|
||||
### Usage
|
||||
|
||||
```shell
|
||||
# example for decapoda-research/llama-7b-hf
|
||||
python projects/mmrazor_large/examples/language_models/LLaMA/llama_gptq.py decapoda-research/llama-7b-hf c4
|
||||
|
||||
# help
|
||||
usage: llama_gptq.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4}
|
||||
|
||||
positional arguments:
|
||||
model Llama model to load
|
||||
|
@ -23,7 +52,4 @@ optional arguments:
|
|||
Batchsize for calibration and evaluation.
|
||||
--save SAVE Path to saved model.
|
||||
-m M Whether to enable memory efficient forward
|
||||
|
||||
# For example, prune decapoda-research/llama-7b-hf
|
||||
python examples/model_examples/language_models/Llama/llama_sparse_gpt.py decapoda-research/llama-7b-hf c4
|
||||
```
|
||||
|
|
|
@ -1,15 +1,17 @@
|
|||
# OPT
|
||||
# Examples for OPT
|
||||
|
||||
## SparseGPT for OPT
|
||||
## SparseGPT
|
||||
|
||||
For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md)
|
||||
|
||||
### Usage
|
||||
|
||||
```shell
|
||||
python examples/model_examples/language_models/OPT/opt_sparse_gpt.py -h
|
||||
usage: opt_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M]
|
||||
model {wikitext2,ptb,c4}
|
||||
# example for facebook/opt-125m
|
||||
python projects/mmrazor_large/examples/language_models/OPT/opt_sparse_gpt.py facebook/opt-125m c4
|
||||
|
||||
# help
|
||||
usage: opt_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4}
|
||||
|
||||
positional arguments:
|
||||
model OPT model to load; pass `facebook/opt-X`.
|
||||
|
@ -23,7 +25,4 @@ optional arguments:
|
|||
Batchsize for calibration and evaluation.
|
||||
--save SAVE Path to saved model.
|
||||
-m M Whether to enable memory efficient forward
|
||||
|
||||
# For example, prune facebook/opt-125m
|
||||
python examples/model_examples/language_models/OPT/opt_sparse_gpt.py facebook/opt-125m c4
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue