[Feature] Add CAM visualization tool (#577)

* add cam-grad tool * refactor cam-grad tool * add docs * update docs * Update docs and support Transformer * remove pictures and use link * replace example img and finish EN docs * improve docs * improve code * Fix MobileNet V3 configs * Refactor to support more powerful feature extraction. * Add unit tests * Fix unit test * fix distortion of visualization exapmles in docs * fix distortion * fix distortion * fix distortion * merge master * merge fix conficts * Imporve the tool * Support use both attribute name and index to get layer * add default get_target-layers * add default get_target-layers * update docs * update docs * add additional printt info when not using target-layers * Imporve docs * Fix enumerate list. Co-authored-by: mzr1996 <mzr1996@163.com>
2021-12-23 18:53:40 +08:00 · 2021-12-23 18:53:40 +08:00 · 131d8c6296
parent da39ca6898
commit 131d8c6296
9 changed files with 702 additions and 25 deletions
--- a/configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py
+++ b/configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py
@ -17,7 +17,7 @@
 # - modify: RandomErasing use RE-M instead of RE-0

 _base_ = [
-    '../_base_/models/mobilenet-v3-large_8xb32_in1k.py',
+    '../_base_/models/mobilenet_v3_large_imagenet.py',
    '../_base_/datasets/imagenet_bs32_pil_resize.py',
    '../_base_/default_runtime.py'
 ]
--- a/demo/bird.JPEG
+++ b/demo/bird.JPEG
--- a/demo/cat-dog.png
+++ b/demo/cat-dog.png
--- a/demo/dog.jpg
+++ b/demo/dog.jpg
--- a/docs/en/tools/visualization.md
+++ b/docs/en/tools/visualization.md
@ -5,6 +5,7 @@
 - [Visualization](#visualization)
  - [Pipeline Visualization](#pipeline-visualization)
  - [Learning Rate Schedule Visualization](#learning-rate-schedule-visualization)
+  - [Class Activation Map Visualization](#class-activation-map-visualization)
  - [FAQs](#faqs)

 <!-- TOC -->
@ -52,27 +53,27 @@ python tools/visualizations/vis_pipeline.py \

 1. Visualize all the transformed pictures of the `ImageNet` training set and display them in pop-up windows：

-```shell
-python ./tools/visualizations/vis_pipeline.py ./configs/resnet/resnet50_8xb32_in1k.py --show --mode pipeline
-```
+   ```shell
+   python ./tools/visualizations/vis_pipeline.py ./configs/resnet/resnet50_8xb32_in1k.py --show --mode pipeline
+   ```

-<div align=center><img src="../_static/image/tools/visualization/pipeline-pipeline.jpg" style=" width: auto; height: 40%; "></div>
+   <div align=center><img src="../_static/image/tools/visualization/pipeline-pipeline.jpg" style=" width: auto; height: 40%; "></div>

 2. Visualize 10 comparison pictures in the `ImageNet` train set and save them in the `./tmp` folder：

-```shell
-python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --output-dir tmp --number 10 --adaptive
-```
+   ```shell
+   python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --output-dir tmp --number 10 --adaptive
+   ```

-<div align=center><img src="../_static/image/tools/visualization/pipeline-concat.jpg" style=" width: auto; height: 40%; "></div>
+   <div align=center><img src="../_static/image/tools/visualization/pipeline-concat.jpg" style=" width: auto; height: 40%; "></div>

 3. Visualize 100 original pictures in the `CIFAR100` validation set, then display and save them in the `./tmp` folder：

-```shell
-python ./tools/visualizations/vis_pipeline.py configs/resnet/resnet50_8xb16_cifar100.py --phase val --output-dir tmp --mode original --number 100  --show --adaptive --bgr2rgb
-```
+   ```shell
+   python ./tools/visualizations/vis_pipeline.py configs/resnet/resnet50_8xb16_cifar100.py --phase val --output-dir tmp --mode original --number 100  --show --adaptive --bgr2rgb
+   ```

-<div align=center><img src="../_static/image/tools/visualization/pipeline-original.jpg" style=" width: auto; height: 40%; "></div>
+   <div align=center><img src="../_static/image/tools/visualization/pipeline-original.jpg" style=" width: auto; height: 40%; "></div>

 ## Learning Rate Schedule Visualization

@ -119,6 +120,173 @@ python tools/visualizations/vis_lr.py configs/repvgg/repvgg-B3g4_4xb64-autoaug-l

 <div align=center><img src="../_static/image/tools/visualization/lr_schedule2.png" style=" width: auto; height: 40%; "></div>

+## Class Activation Map Visualization
+
+MMClassification provides `tools\visualizations\vis_cam.py` tool to visualize class activation map. Please use `pip install "grad-cam>=1.3.6"` command to install [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam).
+
+The supported methods are as follows:
+
+| Method   | What it does |
+|----------|--------------|
+| GradCAM  | Weight the 2D activations by the average gradient |
+| GradCAM++  | Like GradCAM but uses second order gradients |
+| XGradCAM  | Like GradCAM but scale the gradients by the normalized activations |
+| EigenCAM  | Takes the first principle component of the 2D Activations (no class discrimination, but seems to give great results)|
+| EigenGradCAM  | Like EigenCAM but with class discrimination: First principle component of Activations\*Grad. Looks like GradCAM, but cleaner|
+| LayerCAM  | Spatially weight the activations by positive gradients. Works better especially in lower layers |
+
+**Command**：
+
+```bash
+python tools/visualizations/vis_cam.py \
+    ${IMG} \
+    ${CONFIG_FILE} \
+    ${CHECKPOINT} \
+    [--target-layers ${TARGET-LAYERS}] \
+    [--preview-model] \
+    [--method ${METHOD}] \
+    [--target-category ${TARGET-CATEGORY}] \
+    [--save-path ${SAVE_PATH}] \
+    [--vit-like] \
+    [--num-extra-tokens ${NUM-EXTRA-TOKENS}]
+    [--aug_smooth] \
+    [--eigen_smooth] \
+    [--device ${DEVICE}] \
+    [--cfg-options ${CFG-OPTIONS}]
+```
+
+**Description of all arguments**：
+
+- `img` : The target picture path.
+- `config` : The path of the model config file.
+- `checkpoint` : The path of the checkpoint.
+- `--target-layers` : The target layers to get activation maps, one or more network layers can be specified. If not set, use the norm layer of the last block.
+- `--preview-model` : Whether to print all network layer names in the model.
+- `--method` : Visualization method, supports `GradCAM`, `GradCAM++`, `XGradCAM`, `EigenCAM`, `EigenGradCAM`, `LayerCAM`, which is case insensitive. Defaults to `GradCAM`.
+- `--target-category` : Target category, if not set, use the category detected by the given model.
+- `--save-path` : The path to save the CAM visualization image. If not set, the CAM image will not be saved.
+- `--vit-like` : Whether the network is ViT-like network.
+- `--num-extra-tokens` : The number of extra tokens in ViT-like backbones. If not set, use num_extra_tokens the backbone.
+- `--aug_smooth` : Whether to use TTA(Test Time Augment) to get CAM.
+- `--eigen_smooth` : Whether to use the principal component to reduce noise.
+- `--device` : The computing device used. Default to 'cpu'.
+- `--cfg-options` : Modifications to the configuration file, refer to [Tutorial 1: Learn about Configs](https://mmclassification.readthedocs.io/en/latest/tutorials/config.html).
+
+```{note}
+The argument `--preview-model` can view all network layers names in the given model. It will be helpful if you know nothing about the model layers when setting `--target-layers`.
+```
+
+**Examples(CNN)**：
+
+Here are some examples of `target-layers` in ResNet-50, which can be any module or layer:
+
+- `'backbone.layer4'` means the output of the forth ResLayer.
+- `'backbone.layer4.2'` means the output of the third BottleNeck block in the forth ResLayer.
+- `'backbone.layer4.2.conv1'` means the output of the `conv1` layer in above BottleNeck block.
+
+```{note}
+For `ModuleList` or `Sequential`, you can also use the index to specify which sub-module is the target layer.
+
+For example, the `backbone.layer4[-1]` is the same as `backbone.layer4.2` since `layer4` is a `Sequential` with three sub-modules.
+```
+
+1. Use different methods to visualize CAM for `ResNet50`, the `target-category` is the predicted result by the given checkpoint, using the default `target-layers`.
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG \
+       configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --method GradCAM
+       # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
+   ```
+
+   | Image | GradCAM  |  GradCAM++ |  EigenGradCAM |  LayerCAM  |
+   |-------|----------|------------|-------------- |------------|
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="160" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065002-f1c86516-38b2-47ba-90c1-e00b49556c70.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065119-82581fa1-3414-4d6c-a849-804e1503c74b.jpg' height="auto" width="150"></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065096-75a6a2c1-6c57-4789-ad64-ebe5e38765f4.jpg' height="auto" width="150"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065129-814d20fb-98be-4106-8c5e-420adcc85295.jpg' height="auto" width="150"></div>  |
+
+2. Use different `target-category` to get CAM from the same picture. In `ImageNet` dataset, the category 238 is 'Greater Swiss Mountain dog', the category 281 is 'tabby, tabby cat'.
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --target-layers 'backbone.layer4.2' \
+       --method GradCAM \
+       --target-category 238
+       # --target-category 281
+   ```
+
+   | Category  | Image | GradCAM  |  XGradCAM |  LayerCAM  |
+   | --------- |-------|----------|-------------- |------------|
+   |   Dog     | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433562-968a57bc-17d9-413e-810e-f91e334d648a.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433853-319f3a8f-95f2-446d-b84f-3028daca5378.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433937-daef5a69-fd70-428f-98a3-5e7747f4bb88.jpg' height="auto" width="150" ></div>  |
+   |   Cat     | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434518-867ae32a-1cb5-4dbd-b1b9-5e375e94ea48.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434603-0a2fd9ec-c02e-4e6c-a17b-64c234808c56.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434623-b4432cc2-c663-4b97-aed3-583d9d3743e6.jpg' height="auto" width="150" ></div>  |
+
+3. Use `--eigen-smooth` and `--aug-smooth` to improve visual effects.
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/dog.jpg  \
+       configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \
+       --target-layers 'backbone.layer16' \
+       --method LayerCAM \
+       --eigen-smooth --aug-smooth
+   ```
+
+   | Image | LayerCAM  |  eigen-smooth |  aug-smooth |  eigen&aug  |
+   |-------|----------|------------|-------------- |------------|
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557492-98ac5ce0-61f9-4da9-8ea7-396d0b6a20fa.jpg' height="auto" width="160"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557541-a4cf7d86-7267-46f9-937c-6f657ea661b4.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557547-2731b53e-e997-4dd2-a092-64739cc91959.jpg'  height="auto" width="145" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557545-8189524a-eb92-4cce-bf6a-760cab4a8065.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557548-c1e3f3ec-3c96-43d4-874a-3b33cd3351c5.jpg'  height="auto" width="145" ></div>  |
+
+**Examples(Transformer)**：
+
+Here are some examples:
+
+- `'backbone.norm3'` for Swin-Transformer;
+- `'backbone.layers[-1].ln1'` for ViT;
+
+For ViT-like networks, such as ViT, T2T-ViT and Swin-Transformer, the features are flattened. And for drawing the CAM, we need to specify the `--vit-like` argument to reshape the features into square feature maps.
+
+Besides the flattened features, some ViT-like networks also add extra tokens like the class token in ViT and T2T-ViT, and the distillation token in DeiT. In these networks, the final classification is done on the tokens computed in the last attention block, and therefore, the classification score will not be affected by other features and the gradient of the classification score with respect to them, will be zero. Therefore, you shouldn't use the output of the last attention block as the target layer in these networks.
+
+To exclude these extra tokens, we need know the number of extra tokens. Almost all transformer-based backbones in MMClassification have the `num_extra_tokens` attribute. If you want to use this tool in a new or third-party network that don't have the `num_extra_tokens` attribute, please specify it the `--num-extra-tokens` argument.
+
+1. Visualize CAM for `Swin Transformer`, using default `target-layers`:
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/swin_transformer/swin-tiny_16xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \
+       --vit-like
+   ```
+
+2. Visualize CAM for `Vision Transformer(ViT)`:
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/vision_transformer/vit-base-p16_ft-64xb64_in1k-384.py \
+       https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \
+       --vit-like \
+       --target-layers 'backbone.layers[-1].ln1'
+   ```
+
+3. Visualize CAM for `T2T-ViT`:
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \
+       --vit-like \
+       --target-layers 'backbone.encoder[-1].ln1'
+   ```
+
+| Image | ResNet50  |  ViT |  Swin |  T2T-ViT  |
+|-------|----------|------------|-------------- |------------|
+| <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="165" ></div> | <div align=center><img src=https://user-images.githubusercontent.com/18586273/144431491-a2e19fe3-5c12-4404-b2af-a9552f5a95d9.jpg  height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436218-245a11de-6234-4852-9c08-ff5069f6a739.jpg' height="auto" width="150" ></div>   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436168-01b0e565-442c-4e1e-910c-17c62cff7cd3.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436198-51dbfbda-c48d-48cc-ae06-1a923d19b6f6.jpg' height="auto" width="150" ></div>  |
+
 ## FAQs

 - None
--- a/docs/zh_CN/tools/visualization.md
+++ b/docs/zh_CN/tools/visualization.md
@ -5,6 +5,7 @@
 - [可视化](#可视化)
  - [数据流水线可视化](#数据流水线可视化)
  - [学习率策略可视化](#学习率策略可视化)
+  - [类别激活图可视化](#类别激活图可视化)
  - [常见问题](#常见问题)

 <!-- TOC -->
@ -53,27 +54,27 @@ python tools/visualizations/vis_pipeline.py \

 1. 可视化 `ImageNet` 训练集的所有经过预处理的图片，并以弹窗形式显示：

-```shell
-python ./tools/visualizations/vis_pipeline.py ./configs/resnet/resnet50_8xb32_in1k.py --show --mode pipeline
-```
+   ```shell
+   python ./tools/visualizations/vis_pipeline.py ./configs/resnet/resnet50_8xb32_in1k.py --show --mode pipeline
+   ```

-<div align=center><img src="../_static/image/tools/visualization/pipeline-pipeline.jpg" style=" width: auto; height: 40%; "></div>
+   <div align=center><img src="../_static/image/tools/visualization/pipeline-pipeline.jpg" style=" width: auto; height: 40%; "></div>

 2. 可视化 `ImageNet` 训练集的10张原始图片与预处理后图片对比图，保存在 `./tmp` 文件夹下：

-```shell
-python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --output-dir tmp --number 10 --adaptive
-```
+   ```shell
+   python ./tools/visualizations/vis_pipeline.py configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py --phase train --output-dir tmp --number 10 --adaptive
+   ```

-<div align=center><img src="../_static/image/tools/visualization/pipeline-concat.jpg" style=" width: auto; height: 40%; "></div>
+   <div align=center><img src="../_static/image/tools/visualization/pipeline-concat.jpg" style=" width: auto; height: 40%; "></div>

 3. 可视化 `CIFAR100` 验证集中的100张原始图片，显示并保存在 `./tmp` 文件夹下：

-```shell
-python ./tools/visualizations/vis_pipeline.py configs/resnet/resnet50_8xb16_cifar100.py --phase val --output-dir tmp --mode original --number 100 --show --adaptive --bgr2rgb
-```
+   ```shell
+   python ./tools/visualizations/vis_pipeline.py configs/resnet/resnet50_8xb16_cifar100.py --phase val --output-dir tmp --mode original --number 100 --show --adaptive --bgr2rgb
+   ```

-<div align=center><img src="../_static/image/tools/visualization/pipeline-original.jpg" style=" width: auto; height: 40%; "></div>
+   <div align=center><img src="../_static/image/tools/visualization/pipeline-original.jpg" style=" width: auto; height: 40%; "></div>

 ## 学习率策略可视化

@ -122,6 +123,171 @@ python tools/visualizations/vis_lr.py configs/repvgg/repvgg-B3g4_4xb64-autoaug-l

 <div align=center><img src="../_static/image/tools/visualization/lr_schedule2.png" style=" width: auto; height: 40%; "></div>

+## 类别激活图可视化
+
+MMClassification 提供 `tools\visualizations\vis_cam.py` 工具来可视化类别激活图。请使用 `pip install "grad-cam>=1.3.6"` 安装依赖的 [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam)。
+
+目前支持的方法有：
+
+| Method     | What it does |
+|:----------:|:------------:|
+| GradCAM    | 使用平均梯度对 2D 激活进行加权 |
+| GradCAM++  | 类似 GradCAM，但使用了二阶梯度 |
+| XGradCAM   | 类似 GradCAM，但通过归一化的激活对梯度进行了加权 |
+| EigenCAM   | 使用 2D 激活的第一主成分（无法区分类别，但效果似乎不错）|
+| EigenGradCAM  | 类似 EigenCAM，但支持类别区分，使用了激活 \* 梯度的第一主成分，看起来和 GradCAM 差不多，但是更干净 |
+| LayerCAM  | 使用正梯度对激活进行空间加权，对于浅层有更好的效果 |
+
+**命令行**：
+
+```bash
+python tools/visualizations/vis_cam.py \
+    ${IMG} \
+    ${CONFIG_FILE} \
+    ${CHECKPOINT} \
+    [--target-layers ${TARGET-LAYERS}] \
+    [--preview-model] \
+    [--method ${METHOD}] \
+    [--target-category ${TARGET-CATEGORY}] \
+    [--save-path ${SAVE_PATH}] \
+    [--vit-like] \
+    [--num-extra-tokens ${NUM-EXTRA-TOKENS}]
+    [--aug_smooth] \
+    [--eigen_smooth] \
+    [--device ${DEVICE}] \
+    [--cfg-options ${CFG-OPTIONS}]
+```
+
+**所有参数的说明**：
+
+- `img`：目标图片路径。
+- `config`：模型配置文件的路径。
+- `checkpoint`：权重路径。
+- `--target-layers`：所查看的网络层名称，可输入一个或者多个网络层, 如果不设置，将使用最后一个`block`中的`norm`层。
+- `--preview-model`：是否查看模型所有网络层。
+- `--method`：类别激活图图可视化的方法，目前支持 `GradCAM`, `GradCAM++`, `XGradCAM`, `EigenCAM`, `EigenGradCAM`, `LayerCAM`，不区分大小写。如果不设置，默认为 `GradCAM`。
+- `--target-category`：查看的目标类别，如果不设置，使用模型检测出来的类别做为目标类别。
+- `--save-path`：保存的可视化图片的路径，默认不保存。
+- `--eigen-smooth`：是否使用主成分降低噪音，默认不开启。
+- `--vit-like`: 是否为 `ViT` 类似的 Transformer-based 网络
+- `--num-extra-tokens`: `ViT` 类网络的额外的 tokens 通道数，默认使用主干网络的 `num_extra_tokens`。
+- `--aug-smooth`：是否使用测试时增强
+- `--device`：使用的计算设备，如果不设置，默认为'cpu'。
+- `--cfg-options`：对配置文件的修改，参考[教程 1：如何编写配置文件](https://mmclassification.readthedocs.io/zh_CN/latest/tutorials/config.html)。
+
+```{note}
+在指定 `--target-layers` 时，如果不知道模型有哪些网络层，可使用命令行添加 `--preview-model` 查看所有网络层名称；
+```
+
+**示例（CNN）**：
+
+`--target-layers` 在 `Resnet-50` 中的一些示例如下:
+
+- `'backbone.layer4'`，表示第四个 `ResLayer` 层的输出。
+- `'backbone.layer4.2'` 表示第四个 `ResLayer` 层中第三个 `BottleNeck` 块的输出。
+- `'backbone.layer4.2.conv1'` 表示上述 `BottleNeck` 块中 `conv1` 层的输出。
+
+```{note}
+对于 `ModuleList` 或者 `Sequential` 类型的网络层，可以直接使用索引的方式指定子模块。比如 `backbone.layer4[-1]` 和 `backbone.layer4.2` 是相同的，因为 `layer4` 是一个拥有三个子模块的 `Sequential`。
+```
+
+1. 使用不同方法可视化 `ResNet50`，默认 `target-category` 为模型检测的结果，使用默认推导的 `target-layers`。
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG \
+       configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --method GradCAM
+       # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
+   ```
+
+   | Image | GradCAM  |  GradCAM++ |  EigenGradCAM |  LayerCAM  |
+   |-------|----------|------------|-------------- |------------|
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="160" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065002-f1c86516-38b2-47ba-90c1-e00b49556c70.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065119-82581fa1-3414-4d6c-a849-804e1503c74b.jpg' height="auto" width="150"></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065096-75a6a2c1-6c57-4789-ad64-ebe5e38765f4.jpg' height="auto" width="150"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065129-814d20fb-98be-4106-8c5e-420adcc85295.jpg' height="auto" width="150"></div>  |
+
+2. 同一张图不同类别的激活图效果图，在 `ImageNet` 数据集中，类别238为 'Greater Swiss Mountain dog'，类别281为 'tabby, tabby cat'。
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --target-layers 'backbone.layer4.2' \
+       --method GradCAM \
+       --target-category 238
+       # --target-category 281
+   ```
+
+   | Category  | Image | GradCAM  |  XGradCAM |  LayerCAM  |
+   | --------- |-------|----------|-------------- |------------|
+   |   Dog     | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433562-968a57bc-17d9-413e-810e-f91e334d648a.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433853-319f3a8f-95f2-446d-b84f-3028daca5378.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433937-daef5a69-fd70-428f-98a3-5e7747f4bb88.jpg' height="auto" width="150" ></div>  |
+   |   Cat     | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434518-867ae32a-1cb5-4dbd-b1b9-5e375e94ea48.jpg' height="auto" width="150" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434603-0a2fd9ec-c02e-4e6c-a17b-64c234808c56.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434623-b4432cc2-c663-4b97-aed3-583d9d3743e6.jpg' height="auto" width="150" ></div>  |
+
+3. 使用 `--eigen-smooth` 以及 `--aug-smooth` 获取更好的可视化效果。
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/dog.jpg  \
+       configs/mobilenet_v3/mobilenet-v3-large_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \
+       --target-layers 'backbone.layer16' \
+       --method LayerCAM \
+       --eigen-smooth --aug-smooth
+   ```
+
+   | Image | LayerCAM  |  eigen-smooth |  aug-smooth |  eigen&aug  |
+   |-------|----------|------------|-------------- |------------|
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557492-98ac5ce0-61f9-4da9-8ea7-396d0b6a20fa.jpg' height="auto" width="160"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557541-a4cf7d86-7267-46f9-937c-6f657ea661b4.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557547-2731b53e-e997-4dd2-a092-64739cc91959.jpg'  height="auto" width="145" ></div>  | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557545-8189524a-eb92-4cce-bf6a-760cab4a8065.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557548-c1e3f3ec-3c96-43d4-874a-3b33cd3351c5.jpg'  height="auto" width="145" ></div>  |
+
+**示例（Transformer）**：
+
+`--target-layers` 在 Transformer-based 网络中的一些示例如下:
+
+- Swin-Transformer 中：`'backbone.norm3'`
+- ViT 中：`'backbone.layers[-1].ln1'`
+
+对于 Transformer-based 的网络，比如 ViT、T2T-ViT 和 Swin-Transformer，特征是被展平的。为了绘制 CAM 图，我们需要指定 `--vit-like` 选项，从而让被展平的特征恢复方形的特征图。
+
+除了特征被展平之外，一些类 ViT 的网络还会添加额外的 tokens。比如 ViT 和 T2T-ViT 中添加了分类 token，DeiT 中还添加了蒸馏 token。在这些网络中，分类计算在最后一个注意力模块之后就已经完成了，分类得分也只和这些额外的 tokens 有关，与特征图无关，也就是说，分类得分对这些特征图的导数为 0。因此，我们不能使用最后一个注意力模块的输出作为 CAM 绘制的目标层。
+
+另外，为了去除这些额外的 toekns 以获得特征图，我们需要知道这些额外 tokens 的数量。MMClassification 中几乎所有 Transformer-based 的网络都拥有 `num_extra_tokens` 属性。而如果你希望将此工具应用于新的，或者第三方的网络，而且该网络没有指定 `num_extra_tokens` 属性，那么可以使用 `--num-extra-tokens` 参数手动指定其数量。
+
+1. 对 `Swin Transformer` 使用默认 `target-layers` 进行 CAM 可视化：
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/swin_transformer/swin-tiny_16xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \
+       --vit-like
+   ```
+
+2. 对 `Vision Transformer(ViT)` 进行 CAM 可视化：
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/vision_transformer/vit-base-p16_ft-64xb64_in1k-384.py \
+       https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \
+       --vit-like \
+       --target-layers 'backbone.layers[-1].ln1'
+   ```
+
+3. 对 `T2T-ViT` 进行 CAM 可视化：
+
+   ```shell
+   python tools/visualizations/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \
+       --vit-like \
+       --target-layers 'backbone.encoder[-1].ln1'
+   ```
+
+| Image | ResNet50  |  ViT |  Swin |  T2T-ViT   |
+|-------|----------|------------|-------------- |------------|
+| <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="165" ></div> | <div align=center><img src=https://user-images.githubusercontent.com/18586273/144431491-a2e19fe3-5c12-4404-b2af-a9552f5a95d9.jpg  height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436218-245a11de-6234-4852-9c08-ff5069f6a739.jpg' height="auto" width="150" ></div>   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436168-01b0e565-442c-4e1e-910c-17c62cff7cd3.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436198-51dbfbda-c48d-48cc-ae06-1a923d19b6f6.jpg' height="auto" width="150" ></div>  |
+
 ## 常见问题

 - 无
--- a/mmcls/models/backbones/swin_transformer.py
+++ b/mmcls/models/backbones/swin_transformer.py
@ -324,6 +324,7 @@ class SwinTransformer(BaseBackbone):
        self.use_abs_pos_embed = use_abs_pos_embed
        self.auto_pad = auto_pad
        self.frozen_stages = frozen_stages
+        self.num_extra_tokens = 0

        _patch_cfg = {
            'img_size': img_size,
--- a/mmcls/models/backbones/t2t_vit.py
+++ b/mmcls/models/backbones/t2t_vit.py
@ -282,6 +282,7 @@ class T2T_ViT(BaseBackbone):
        # Class token
        self.output_cls_token = output_cls_token
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dims))
+        self.num_extra_tokens = 1

        # Position Embedding
        sinusoid_table = get_sinusoid_encoding(num_patches + 1, embed_dims)
--- a/tools/visualizations/vis_cam.py
+++ b/tools/visualizations/vis_cam.py
@ -0,0 +1,341 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import copy
+import math
+import re
+from pathlib import Path
+
+import mmcv
+import numpy as np
+from mmcv import Config, DictAction
+from mmcv.utils import to_2tuple
+from torch.nn import BatchNorm1d, BatchNorm2d, GroupNorm, LayerNorm
+
+from mmcls.apis import init_model
+from mmcls.datasets.pipelines import Compose
+
+try:
+    from pytorch_grad_cam import (EigenCAM, GradCAM, GradCAMPlusPlus, XGradCAM,
+                                  EigenGradCAM, LayerCAM)
+    from pytorch_grad_cam.activations_and_gradients import (
+        ActivationsAndGradients)
+    from pytorch_grad_cam.utils.image import show_cam_on_image
+except ImportError:
+    raise ImportError('Please run `pip install "grad-cam>=1.3.6"` to install '
+                      '3rd party package pytorch_grad_cam.')
+
+# set of transforms, which just change data format, not change the pictures
+FORMAT_TRANSFORMS_SET = {'ToTensor', 'Normalize', 'ImageToTensor', 'Collect'}
+
+# Supported grad-cam type map
+METHOD_MAP = {
+    'gradcam': GradCAM,
+    'gradcam++': GradCAMPlusPlus,
+    'xgradcam': XGradCAM,
+    'eigencam': EigenCAM,
+    'eigengradcam': EigenGradCAM,
+    'layercam': LayerCAM,
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Visualize CAM')
+    parser.add_argument('img', help='Image file')
+    parser.add_argument('config', help='Config file')
+    parser.add_argument('checkpoint', help='Checkpoint file')
+    parser.add_argument(
+        '--target-layers',
+        default=[],
+        nargs='+',
+        type=str,
+        help='The target layers to get CAM, if not set, the tool will '
+        'specify the norm layer in the last block. Backbones '
+        'implemented by users are recommended to manually specify'
+        ' target layers in commmad statement.')
+    parser.add_argument(
+        '--preview-model',
+        default=False,
+        action='store_true',
+        help='To preview all the model layers')
+    parser.add_argument(
+        '--method',
+        default='GradCAM',
+        help='Type of method to use, supports '
+        f'{", ".join(list(METHOD_MAP.keys()))}.')
+    parser.add_argument(
+        '--target-category',
+        default=None,
+        type=int,
+        help='The target category to get CAM, default to use result '
+        'get from given model.')
+    parser.add_argument(
+        '--eigen-smooth',
+        default=False,
+        action='store_true',
+        help='Reduce noise by taking the first principle componenet of '
+        '``cam_weights*activations``')
+    parser.add_argument(
+        '--aug-smooth',
+        default=False,
+        action='store_true',
+        help='Wether to use test time augmentation, default not to use')
+    parser.add_argument(
+        '--save-path',
+        type=Path,
+        help='The path to save visualize cam image, default not to save.')
+    parser.add_argument('--device', default='cpu', help='Device to use cpu')
+    parser.add_argument(
+        '--vit-like',
+        action='store_true',
+        help='Whether the network is a ViT-like network.')
+    parser.add_argument(
+        '--num-extra-tokens',
+        type=int,
+        help='The number of extra tokens in ViT-like backbones. Defaults to'
+        ' use num_extra_tokens of the backbone.')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+    if args.method.lower() not in METHOD_MAP.keys():
+        raise ValueError(f'invalid CAM type {args.method},'
+                         f' supports {", ".join(list(METHOD_MAP.keys()))}.')
+
+    return args
+
+
+def build_reshape_transform(model, args):
+    """Build reshape_transform for `cam.activations_and_grads`, which is
+    necessary for ViT-like networks."""
+    # ViT_based_Transformers have an additional clstoken in features
+    if not args.vit_like:
+
+        def check_shape(tensor):
+            assert len(tensor.size()) != 3, \
+                (f"The input feature's shape is {tensor.size()}, and it seems "
+                 'to have been flattened or from a vit-like network. '
+                 "Please use `--vit-like` if it's from a vit-like network.")
+            return tensor
+
+        return check_shape
+
+    if args.num_extra_tokens is not None:
+        num_extra_tokens = args.num_extra_tokens
+    elif hasattr(model.backbone, 'num_extra_tokens'):
+        num_extra_tokens = model.backbone.num_extra_tokens
+    else:
+        num_extra_tokens = 1
+
+    def _reshape_transform(tensor):
+        """reshape_transform helper."""
+        assert len(tensor.size()) == 3, \
+            (f"The input feature's shape is {tensor.size()}, "
+             'and the feature seems not from a vit-like network?')
+        tensor = tensor[:, num_extra_tokens:, :]
+        # get heat_map_height and heat_map_width, preset input is a square
+        heat_map_area = tensor.size()[1]
+        height, width = to_2tuple(int(math.sqrt(heat_map_area)))
+        assert height * height == heat_map_area, \
+            (f"The input feature's length ({heat_map_area+num_extra_tokens}) "
+             f'minus num-extra-tokens ({num_extra_tokens}) is {heat_map_area},'
+             ' which is not a perfect square number. Please check if you used '
+             'a wrong num-extra-tokens.')
+        result = tensor.reshape(tensor.size(0), height, width, tensor.size(2))
+
+        # Bring the channels to the first dimension, like in CNNs.
+        result = result.transpose(2, 3).transpose(1, 2)
+        return result
+
+    return _reshape_transform
+
+
+def apply_transforms(img_path, pipeline_cfg):
+    """Apply transforms pipeline and get both formatted data and the image
+    without formatting."""
+    data = dict(img_info=dict(filename=img_path), img_prefix=None)
+
+    def split_pipeline_cfg(pipeline_cfg):
+        """to split the transfoms into image_transforms and
+        format_transforms."""
+        image_transforms_cfg, format_transforms_cfg = [], []
+        if pipeline_cfg[0]['type'] != 'LoadImageFromFile':
+            pipeline_cfg.insert(0, dict(type='LoadImageFromFile'))
+        for transform in pipeline_cfg:
+            if transform['type'] in FORMAT_TRANSFORMS_SET:
+                format_transforms_cfg.append(transform)
+            else:
+                image_transforms_cfg.append(transform)
+        return image_transforms_cfg, format_transforms_cfg
+
+    image_transforms, format_transforms = split_pipeline_cfg(pipeline_cfg)
+    image_transforms = Compose(image_transforms)
+    format_transforms = Compose(format_transforms)
+
+    intermediate_data = image_transforms(data)
+    inference_img = copy.deepcopy(intermediate_data['img'])
+    format_data = format_transforms(intermediate_data)
+
+    return format_data, inference_img
+
+
+class MMActivationsAndGradients(ActivationsAndGradients):
+    """Activations and gradients manager for mmcls models."""
+
+    def __call__(self, x):
+        self.gradients = []
+        self.activations = []
+        return self.model(
+            x, return_loss=False, softmax=False, post_process=False)
+
+
+def init_cam(method, model, target_layers, use_cuda, reshape_transform):
+    """Construct the CAM object once, In order to be compatible with mmcls,
+    here we modify the ActivationsAndGradients object."""
+
+    GradCAM_Class = METHOD_MAP[method.lower()]
+    cam = GradCAM_Class(
+        model=model, target_layers=target_layers, use_cuda=use_cuda)
+    # Release the original hooks in ActivationsAndGradients to use
+    # MMActivationsAndGradients.
+    cam.activations_and_grads.release()
+    cam.activations_and_grads = MMActivationsAndGradients(
+        cam.model, cam.target_layers, reshape_transform)
+
+    return cam
+
+
+def get_layer(layer_str, model):
+    """get model layer from given str."""
+    cur_layer = model
+    layer_names = layer_str.strip().split('.')
+
+    def get_children_by_name(model, name):
+        try:
+            return getattr(model, name)
+        except AttributeError as e:
+            raise AttributeError(
+                e.args[0] +
+                '. Please use `--preview-model` to check keys at first.')
+
+    def get_children_by_eval(model, name):
+        try:
+            return eval(f'model{name}', {}, {'model': model})
+        except (AttributeError, IndexError) as e:
+            raise AttributeError(
+                e.args[0] +
+                '. Please use `--preview-model` to check keys at first.')
+
+    for layer_name in layer_names:
+        match_res = re.match('(?P<name>.+?)(?P<indices>(\\[.+\\])+)',
+                             layer_name)
+        if match_res:
+            layer_name = match_res.groupdict()['name']
+            indices = match_res.groupdict()['indices']
+            cur_layer = get_children_by_name(cur_layer, layer_name)
+            cur_layer = get_children_by_eval(cur_layer, indices)
+        else:
+            cur_layer = get_children_by_name(cur_layer, layer_name)
+
+    return cur_layer
+
+
+def show_cam_grad(grayscale_cam, src_img, title, out_path=None):
+    """fuse src_img and grayscale_cam and show or save."""
+    grayscale_cam = grayscale_cam[0, :]
+    src_img = np.float32(src_img) / 255
+    visualization_img = show_cam_on_image(
+        src_img, grayscale_cam, use_rgb=False)
+
+    if out_path:
+        mmcv.imwrite(visualization_img, str(out_path))
+    else:
+        mmcv.imshow(visualization_img, win_name=title)
+
+
+def get_default_traget_layers(model, args):
+    """get default target layers from given model, here choose nrom type layer
+    as default target layer."""
+    norm_layers = []
+    for m in model.backbone.modules():
+        if isinstance(m, (BatchNorm2d, LayerNorm, GroupNorm, BatchNorm1d)):
+            norm_layers.append(m)
+    if len(norm_layers) == 0:
+        raise ValueError(
+            '`--target-layers` is empty. Please use `--preview-model`'
+            ' to check keys at first and then specify `target-layers`.')
+    # if the model is CNN model or Swin model, just use the last norm
+    # layer as the target-layer, if the model is ViT model, the final
+    # classification is done on the class token computed in the last
+    # attention block, the output will not be affected by the 14x14
+    # channels in the last layer. The gradient of the output with
+    # respect to them, will be 0! here use the last 3rd norm layer.
+    # means the first norm of the last decoder block.
+    if args.vit_like:
+        if args.num_extra_tokens:
+            num_extra_tokens = args.num_extra_tokens
+        elif hasattr(model.backbone, 'num_extra_tokens'):
+            num_extra_tokens = model.backbone.num_extra_tokens
+        else:
+            raise AttributeError('Please set num_extra_tokens in backbone'
+                                 " or using 'num-extra-tokens'")
+
+        # if a vit-like backbone's num_extra_tokens bigger than 0, view it
+        # as a VisionTransformer backbone, eg. DeiT, T2T-ViT.
+        if num_extra_tokens >= 1:
+            print('Automatically choose the last norm layer before the '
+                  'final attention block as target_layer..')
+            return [norm_layers[-3]]
+    print('Automatically choose the last norm layer as target_layer.')
+    target_layers = [norm_layers[-1]]
+    return target_layers
+
+
+def main():
+    args = parse_args()
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    # build the model from a config file and a checkpoint file
+    model = init_model(cfg, args.checkpoint, device=args.device)
+    if args.preview_model:
+        print(model)
+        print('\n Please remove `--preview-model` to get the CAM.')
+        return
+
+    # apply transform and perpare data
+    data, src_img = apply_transforms(args.img, cfg.data.test.pipeline)
+
+    # build target layers
+    if args.target_layers:
+        target_layers = [
+            get_layer(layer, model) for layer in args.target_layers
+        ]
+    else:
+        target_layers = get_default_traget_layers(model, args)
+
+    # init a cam grad calculator
+    use_cuda = ('cuda' in args.device)
+    reshape_transform = build_reshape_transform(model, args)
+    cam = init_cam(args.method, model, target_layers, use_cuda,
+                   reshape_transform)
+
+    # calculate cam grads and show|save the visualization image
+    grayscale_cam = cam(
+        input_tensor=data['img'].unsqueeze(0),
+        target_category=args.target_category,
+        eigen_smooth=args.eigen_smooth,
+        aug_smooth=args.aug_smooth)
+    show_cam_grad(
+        grayscale_cam, src_img, title=args.method, out_path=args.save_path)
+
+
+if __name__ == '__main__':
+    main()