From 2df570b2419cf3af2d0eef9c432bb54f2c6ae568 Mon Sep 17 00:00:00 2001 From: humu789 Date: Wed, 17 May 2023 19:00:17 +0800 Subject: [PATCH] add gptq readme --- projects/mmrazor_large/algorithms/GPTQ.md | 56 +++++++++++++++++++ .../mmrazor_large/algorithms/SparseGPT.md | 13 +++-- .../mmrazor_large/examples/ResNet/README.md | 25 +++++++++ .../ResNet/{gptq => }/resnet18_gptq.py | 0 .../{sparse_gpt => }/resnet18_sparse_gpt.py | 0 .../examples/ResNet/sparse_gpt/README.md | 11 ---- .../examples/language_models/LLaMA/README.md | 42 +++++++++++--- .../examples/language_models/OPT/README.md | 15 +++-- 8 files changed, 131 insertions(+), 31 deletions(-) create mode 100644 projects/mmrazor_large/algorithms/GPTQ.md create mode 100644 projects/mmrazor_large/examples/ResNet/README.md rename projects/mmrazor_large/examples/ResNet/{gptq => }/resnet18_gptq.py (100%) rename projects/mmrazor_large/examples/ResNet/{sparse_gpt => }/resnet18_sparse_gpt.py (100%) delete mode 100644 projects/mmrazor_large/examples/ResNet/sparse_gpt/README.md diff --git a/projects/mmrazor_large/algorithms/GPTQ.md b/projects/mmrazor_large/algorithms/GPTQ.md new file mode 100644 index 00000000..b013a73a --- /dev/null +++ b/projects/mmrazor_large/algorithms/GPTQ.md @@ -0,0 +1,56 @@ +# GPTQ + +> [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) + + + +## Abstract + +Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highlyaccurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq. + +## Usage + +GPTQ is easy to use in mmrazor. You can use it like this: + +```python +from mmrazor.implementations.quantization import gptq + +# initial model, dataloaders +model +train_loader, test_loader + +## init gptq compressor and prepare for quantization +compressor = gptq.GPTQCompressor() +compressor.prepare(model) + +## get hessian matrix +compressor.init_hessian() +compressor.register_hessian_hooks() +infer(model, test_loader, num_samples=num_samples) +compressor.remove_hessian_hooks() + +## quant +compressor.quant_with_default_qconfig() + +## to a normal torch model +model = compressor.to_static_model(model) + +``` + +## Full Examples + +- [ResNet](../examples/ResNet/README.md) +- [LLaMA](../examples/language_models/LLaMA/README.md) + +## Cite + +```latex + @misc{ + Frantar_Ashkboos_Hoefler_Alistarh_2022, + title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers}, + author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan}, + year={2022}, + month={Oct}, + language={en-US} +} +``` diff --git a/projects/mmrazor_large/algorithms/SparseGPT.md b/projects/mmrazor_large/algorithms/SparseGPT.md index 8cf18254..479235ba 100644 --- a/projects/mmrazor_large/algorithms/SparseGPT.md +++ b/projects/mmrazor_large/algorithms/SparseGPT.md @@ -1,6 +1,10 @@ # SparseGPT -## abstract +> [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://arxiv.org/abs/2301.00774) + + + +## Abstract We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches. @@ -19,7 +23,8 @@ train_loader, test_loader compressor = sparse_gpt.SparseGptCompressor() compressor.prepare(model) -## init hessian matrix +## get hessian matrix +compressor.init_hessian() compressor.register_hessian_hooks() infer(model, test_loader, num_samples=num_samples) compressor.remove_hessian_hooks() @@ -34,9 +39,9 @@ model = compressor.to_static_model(model) ## Full Examples -- [ResNet](../examples/ResNet/sparse_gpt/README.md) +- [ResNet](../examples/ResNet/README.md) - [OPT](../examples/language_models/OPT/README.md) -- [Llama](../examples/language_models/Llama/README.md) +- [LLaMA](../examples/language_models/LLaMA/README.md) ## Cite diff --git a/projects/mmrazor_large/examples/ResNet/README.md b/projects/mmrazor_large/examples/ResNet/README.md new file mode 100644 index 00000000..aa4eb374 --- /dev/null +++ b/projects/mmrazor_large/examples/ResNet/README.md @@ -0,0 +1,25 @@ +# Examples for ResNet + +## SparseGPT + +For more details about SparseGPT, please refer to [SparseGPT](../../algorithms/SparseGPT.md) + +### Usage + +```shell +python projects/mmrazor_large/examples/ResNet/resnet18_sparse_gpt.py --data {imagenet_path} --batchsize 128 --num_samples 512 +``` + +**Note**: this imagenet folder follows torch format. + +## GPTQ + +For more details about GPTQ, please refer to [GPTQ](../../algorithms/GPTQ.md) + +### Usage + +```shell +python projects/mmrazor_large/examples/ResNet/resnet18_gptq.py --data {imagenet_path} --batchsize 128 --num_samples 512 +``` + +**Note**: this imagenet folder follows torch format. diff --git a/projects/mmrazor_large/examples/ResNet/gptq/resnet18_gptq.py b/projects/mmrazor_large/examples/ResNet/resnet18_gptq.py similarity index 100% rename from projects/mmrazor_large/examples/ResNet/gptq/resnet18_gptq.py rename to projects/mmrazor_large/examples/ResNet/resnet18_gptq.py diff --git a/projects/mmrazor_large/examples/ResNet/sparse_gpt/resnet18_sparse_gpt.py b/projects/mmrazor_large/examples/ResNet/resnet18_sparse_gpt.py similarity index 100% rename from projects/mmrazor_large/examples/ResNet/sparse_gpt/resnet18_sparse_gpt.py rename to projects/mmrazor_large/examples/ResNet/resnet18_sparse_gpt.py diff --git a/projects/mmrazor_large/examples/ResNet/sparse_gpt/README.md b/projects/mmrazor_large/examples/ResNet/sparse_gpt/README.md deleted file mode 100644 index 0541fb03..00000000 --- a/projects/mmrazor_large/examples/ResNet/sparse_gpt/README.md +++ /dev/null @@ -1,11 +0,0 @@ -# SparseGPT for ResNet - -For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md) - -## Usage - -```shell -python examples/model_examples/ResNet/sparse_gpt/resnet18_sparse_gpt.py --data {imagenet_path} --batchsize 128 --num_samples 512 -``` - -**Note**: this imagenet folder follows torch format. diff --git a/projects/mmrazor_large/examples/language_models/LLaMA/README.md b/projects/mmrazor_large/examples/language_models/LLaMA/README.md index 5c50323f..7d9862de 100644 --- a/projects/mmrazor_large/examples/language_models/LLaMA/README.md +++ b/projects/mmrazor_large/examples/language_models/LLaMA/README.md @@ -1,15 +1,44 @@ -# Llama +# Examples for LLaMA -## SparseGPT for LL +## SparseGPT For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md) ### Usage ```shell -python examples/model_examples/language_models/Llama/llama_sparse_gpt.py -h -usage: llama_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] - model {wikitext2,ptb,c4} +# example for decapoda-research/llama-7b-hf +python projects/mmrazor_large/examples/language_models/LLaMA/llama_sparse_gpt.py decapoda-research/llama-7b-hf c4 + +# help +usage: llama_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4} + +positional arguments: + model Llama model to load + {wikitext2,ptb,c4} Where to extract calibration data from. + +optional arguments: + -h, --help show this help message and exit + --seed SEED Seed for sampling the calibration data. + --nsamples NSAMPLES Number of calibration data samples. + --batch_size BATCH_SIZE + Batchsize for calibration and evaluation. + --save SAVE Path to saved model. + -m M Whether to enable memory efficient forward +``` + +## GPTQ + +For more details about GPTQ, please refer to [GPTQ](../../../algorithms/GPTQ.md) + +### Usage + +```shell +# example for decapoda-research/llama-7b-hf +python projects/mmrazor_large/examples/language_models/LLaMA/llama_gptq.py decapoda-research/llama-7b-hf c4 + +# help +usage: llama_gptq.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4} positional arguments: model Llama model to load @@ -23,7 +52,4 @@ optional arguments: Batchsize for calibration and evaluation. --save SAVE Path to saved model. -m M Whether to enable memory efficient forward - -# For example, prune decapoda-research/llama-7b-hf -python examples/model_examples/language_models/Llama/llama_sparse_gpt.py decapoda-research/llama-7b-hf c4 ``` diff --git a/projects/mmrazor_large/examples/language_models/OPT/README.md b/projects/mmrazor_large/examples/language_models/OPT/README.md index ee139739..4af52b5d 100644 --- a/projects/mmrazor_large/examples/language_models/OPT/README.md +++ b/projects/mmrazor_large/examples/language_models/OPT/README.md @@ -1,15 +1,17 @@ -# OPT +# Examples for OPT -## SparseGPT for OPT +## SparseGPT For more details about SparseGPT, please refer to [SparseGPT](../../../algorithms/SparseGPT.md) ### Usage ```shell -python examples/model_examples/language_models/OPT/opt_sparse_gpt.py -h -usage: opt_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] - model {wikitext2,ptb,c4} +# example for facebook/opt-125m +python projects/mmrazor_large/examples/language_models/OPT/opt_sparse_gpt.py facebook/opt-125m c4 + +# help +usage: opt_sparse_gpt.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--batch_size BATCH_SIZE] [--save SAVE] [-m M] model {wikitext2,ptb,c4} positional arguments: model OPT model to load; pass `facebook/opt-X`. @@ -23,7 +25,4 @@ optional arguments: Batchsize for calibration and evaluation. --save SAVE Path to saved model. -m M Whether to enable memory efficient forward - -# For example, prune facebook/opt-125m -python examples/model_examples/language_models/OPT/opt_sparse_gpt.py facebook/opt-125m c4 ```