mmocr/demo/tutorial.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jU9T31gbQmvs"
      },
      "source": [
        "# MMOCR Tutorial\n",
        "\n",
        "Welcome to MMOCR! This is the official colab tutorial for using MMOCR. In this tutorial, you will learn how to\n",
        "\n",
        "- Install MMOCR from source\n",
        "- Perform inference with\n",
        "  - a pretrained text recognizer\n",
        "  - a pretrained text detector\n",
        "  - pretrained recognizer and detector\n",
        "  - pretrained Key Information Extraction (KIE) model\n",
        "- Evaluate a text detection model on an acadmic dataset\n",
        "- Train a text recognizer with a toy dataset\n",
        "\n",
        "Let's start!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Sfvz1sywQ9_4"
      },
      "source": [
        "## Install MMOCR from source\n",
        "\n",
        "Installing MMOCR is straightforward. We recommend users to install MMOCR from source as any local code changes on MMOCR can take effect immediately, which is needed for research & developement purpose. Refer to [documentation](https://mmocr.readthedocs.io/en/dev-1.x/get_started/install.html) for more information."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tw7u_baQpEUs"
      },
      "source": [
        "### Install Dependencies using MIM"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DwDY3puNNmhe",
        "tags": [
          "outputPrepend"
        ]
      },
      "outputs": [],
      "source": [
        "!pip install -U openmim\n",
        "!mim install mmengine\n",
        "!mim install 'mmcv>=2.0.0rc1'\n",
        "!mim install 'mmdet>=3.0.0rc0'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RQ0hd0HP_wje"
      },
      "source": [
        "### Install MMOCR"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Izfq4Xm-_v88"
      },
      "outputs": [],
      "source": [
        "!git clone https://github.com/open-mmlab/mmocr.git\n",
        "%cd mmocr\n",
        "!pip install -v -e .\n",
        "# \"-v\" increases pip's verbosity.\n",
        "# \"-e\" means installing the project in editable mode,\n",
        "# That is, any local modifications on the code will take effect immediately."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YCLL7zlu5Hm1"
      },
      "source": [
        "## Inference\n",
        "\n",
        "MMOCR has made inference easy by providing a variety of `Inferencer`s. In this section, we will focus on the usage of `MMOCRInferencer`. However, if you want to learn more about other `Inferencer`s, you can refer to the [documentation](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/inference.html) which provides detailed descriptions."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "59gHy8Y4pEUv"
      },
      "source": [
        "### Perform Inference with a Pretrained Text Recognizer \n",
        "\n",
        "We now demonstrate how to inference on a demo text recognition image with a pretrained text recognizer. SVTR text recognizer is used for this demo, whose checkpoint can be found in the [official documentation](https://mmocr.readthedocs.io/en/dev-1.x/textrecog_models.html#svtr). But you don't need to download it manually -- Our Inferencer script handles these cumbersome setup steps for you! \n",
        "\n",
        "Run the following command and you will get the inference result from return value as well as files, which will be  visualized in the end."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iQQIVH9ApEUv"
      },
      "outputs": [],
      "source": [
        "from mmocr.apis import MMOCRInferencer\n",
        "infer = MMOCRInferencer(rec='svtr-small')\n",
        "result = infer('demo/demo_text_recog.jpg', save_vis=True, return_vis=True)\n",
        "print(result['predictions'])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dvtCNl8qKkKJ"
      },
      "outputs": [],
      "source": [
        "# Visualize the return value\n",
        "import matplotlib.pyplot as plt\n",
        "plt.imshow(result['visualization'][0])\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0ab_YJnXpEUw"
      },
      "outputs": [],
      "source": [
        "# Visualize the saved image\n",
        "import mmcv\n",
        "predicted_img = mmcv.imread('results/vis/demo_text_recog.jpg')\n",
        "plt.imshow(mmcv.bgr2rgb(predicted_img))\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NgoH6qEcC9CL"
      },
      "source": [
        "### Perform Inference with a Pretrained Text Detector \n",
        "\n",
        "Next, we perform inference with a pretrained DBNet++ text detector and visualize the bounding box results for the demo text detection image."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "u0YyG9y0TzL4"
      },
      "outputs": [],
      "source": [
        "from mmocr.apis import MMOCRInferencer\n",
        "infer = MMOCRInferencer(det='dbnetpp')\n",
        "result = infer('demo/demo_text_det.jpg', return_vis=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2-UHsqkZJFND"
      },
      "outputs": [],
      "source": [
        "# Visualize the results\n",
        "import matplotlib.pyplot as plt\n",
        "plt.figure(figsize=(9, 16))\n",
        "plt.imshow(result['visualization'][0])\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x-uRAtLa63sz"
      },
      "source": [
        "### Perform end-to-end OCR with pretrained recognizer and detector\n",
        "\n",
        "We can  any text detector and recognizer into a pipeline that forms a standard OCR pipeline. Now we build our own OCR pipeline with DBNet++ and SVTR and apply it to `demo/demo_text_ocr.jpg`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xu68YizP8qu6"
      },
      "outputs": [],
      "source": [
        "from mmocr.apis import MMOCRInferencer\n",
        "infer = MMOCRInferencer(det='dbnetpp', rec='svtr-small')\n",
        "result = infer('demo/demo_text_ocr.jpg', return_vis=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2AZqwCt09XqR"
      },
      "outputs": [],
      "source": [
        "# Visualize the results\n",
        "import matplotlib.pyplot as plt\n",
        "plt.figure(figsize=(9, 16))\n",
        "plt.imshow(result['visualization'][0])\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WQ9zzYMa9p9Y"
      },
      "source": [
        "### Combine OCR with Downstream Tasks\n",
        "\n",
        "MMOCR also supports downstream tasks of OCR, such as key information extraction (KIE). We can even add a KIE model, SDMG-R, to the pipeline applied to `demo/demo_kie.jpeg` and visualize its prediction.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2KPRTdHVAGfF"
      },
      "outputs": [],
      "source": [
        "from mmocr.apis import MMOCRInferencer\n",
        "infer = MMOCRInferencer(det='dbnetpp', rec='svtr-small', kie='SDMGR')\n",
        "result = infer('demo/demo_kie.jpeg', save_vis=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "96hqfaovAGhl"
      },
      "outputs": [],
      "source": [
        "# Visualize the results\n",
        "import mmcv\n",
        "import matplotlib.pyplot as plt\n",
        "predicted_img = mmcv.imread('results/vis/demo_kie.jpg')\n",
        "plt.figure(figsize=(18, 32))\n",
        "plt.imshow(mmcv.bgr2rgb(predicted_img))\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nYon41X7RTOT"
      },
      "source": [
        "## Training SAR on a Toy Dataset\n",
        "\n",
        "We now demonstrate how to train a recognizer on a provided dataset in a Python interpreter. Another common practice is to train a model from CLI (command line interface), as illustrated [here](https://mmocr.readthedocs.io/en/dev-1.x/get_started/quick_run.html#training).\n",
        "\n",
        "Since training a full academic dataset is time consuming (usually takes about several hours or even days), we will train on the toy dataset for the SAR text recognition model and visualize the predictions. Text detection and other downstream tasks such as KIE follow similar procedures.\n",
        "\n",
        "Training a model usually consists of the following steps:\n",
        "1. Convert the dataset into [formats supported by MMOCR](https://mmocr.readthedocs.io/en/dev-1.x/basic_concepts/datasets.html). It should never be a concern if the dataset is obtained from Dataset Preparer. Otherwise, you will need to manually download and prepare the dataset following the [guide](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/data_prepare/recog.html), or even have to write a custom conversion script if your dataset is not on the list.\n",
        "2. Modify the config for training. \n",
        "3. Train the model. \n",
        "\n",
        "In this example, we will use an off-the-shelf toy dataset to train SAR, and the first step will be skipped. The full demonstration of the first step can be found at the next section: Evaluating SAR on academic testsets."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FElJSp1vpEUz"
      },
      "source": [
        "### Visualize the Toy Dataset\n",
        "\n",
        "We first get a sense of what the toy dataset looks like by visualizing one of the images and labels. The toy dataset consisits of ten images as well as annotation files in both json and lmdb format, and we only use json annotations in this tutorial."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hZfd2pnqN5-Q"
      },
      "outputs": [],
      "source": [
        "import mmcv\n",
        "import matplotlib.pyplot as plt \n",
        "\n",
        "img = mmcv.imread('tests/data/rec_toy_dataset/imgs/1058891.jpg')\n",
        "plt.imshow(mmcv.bgr2rgb(img))\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "F5M_FVVRN6Fw"
      },
      "outputs": [],
      "source": [
        "# Inspect the labels of the annootation file\n",
        "!cat tests/data/rec_toy_dataset/labels.json"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i-GrV0xSkAc3"
      },
      "source": [
        "### Load Config\n",
        "\n",
        "First we will load the toy config for SAR.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uFFH3yUgPEFj"
      },
      "outputs": [],
      "source": [
        "from mmengine import Config\n",
        "# Load the config\n",
        "cfg = Config.fromfile('configs/textrecog/sar/sar_resnet31_parallel-decoder_5e_toy.py')\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CW2AYtBUA2yb"
      },
      "source": [
        "In order to train SAR to its best state on toy dataset, we need to modify some hyperparameters in the config to accomodate some of the settings of colab.\n",
        "For more explanation about the config and its fields, please refer to [documentation](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/config.html)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "67OJ6oAvN6NA"
      },
      "outputs": [],
      "source": [
        "# Specify the work dir\n",
        "cfg.work_dir = 'work_dirs/sar_resnet31_parallel-decoder_5e_toy/'\n",
        "# Configure the batch size, learning rate, and maximum epochs\n",
        "cfg.optim_wrapper.optimizer.lr = 1e-3\n",
        "cfg.train_dataloader.batch_size = 5\n",
        "cfg.train_cfg.max_epochs = 100\n",
        "# Save checkpoint every 10 epochs\n",
        "cfg.default_hooks.checkpoint.interval = 10\n",
        "\n",
        "# We don't need any learning rate scheduler for a toy dataset\n",
        "# thus clear parameter scheduler here\n",
        "cfg.param_scheduler = None\n",
        "\n",
        "# Set seed thus the results are more reproducible\n",
        "cfg.randomness = dict(seed=0)\n",
        "\n",
        "# We can initialize the logger for training and have a look\n",
        "# at the final config used for training\n",
        "print(f'Config:\\n{cfg.pretty_text}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TZj5vyqEmulE"
      },
      "source": [
        "### Train the SAR Text Recognizer \n",
        "Let's train the SAR text recognizer on the toy dataset for 10 epochs. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mDVkK6yjpEU1"
      },
      "outputs": [],
      "source": [
        "from mmengine.runner import Runner\n",
        "import time\n",
        "\n",
        "# Optionally, give visualizer a unique name to avoid dupliate instance being\n",
        "# created in multiple runs\n",
        "cfg.visualizer.name = f'{time.localtime()}'\n",
        "\n",
        "runner = Runner.from_cfg(cfg)\n",
        "runner.train()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sklydRNXnfJk"
      },
      "source": [
        "### Perform inference and Visualize the Predictions\n",
        "\n",
        "We can test the model through [Infernecer](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/inference.html), then print out and visualize its return values. Inferencer can also accepts many more types of inputs, just feel free to play around with it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-HbXY7uUpEU1"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "from mmocr.apis import TextRecInferencer\n",
        "\n",
        "img = 'tests/data/rec_toy_dataset/imgs/1036169.jpg'\n",
        "checkpoint = \"work_dirs/sar_resnet31_parallel-decoder_5e_toy/epoch_100.pth\"\n",
        "cfg_file = \"configs/textrecog/sar/sar_resnet31_parallel-decoder_5e_toy.py\"\n",
        "\n",
        "infer = TextRecInferencer(cfg_file, checkpoint)\n",
        "result = infer(img, return_vis=True)\n",
        "\n",
        "print(f'result: {result[\"predictions\"]}' )\n",
        "\n",
        "plt.figure(figsize=(9, 16))\n",
        "plt.imshow(result['visualization'][0])\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PTWMzvd3E_h8"
      },
      "source": [
        "## Evaluating SAR\n",
        "\n",
        "This section provides guidance on how to evaluate a model using with pretrained weights in a Python interpreter. Apart from such a practice, another common practice is to test a model from CLI (command line interface), as illustrated [here](https://mmocr.readthedocs.io/en/dev-1.x/get_started/quick_run.html#testing).\n",
        "\n",
        "Typically, the evaluation process involves several steps:\n",
        "\n",
        "1. Convert the dataset into [formats supported by MMOCR](https://mmocr.readthedocs.io/en/dev-1.x/basic_concepts/datasets.html). It should not be a concern if the dataset is obtained from [Dataset Preparer](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/data_prepare/dataset_preparer.html), which can download, extract and convert the dataset into a MMOCR-ready form with a single line of command. Otherwise, you will need to manually download and prepare the dataset following the [guide](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/data_prepare/det.html), or even have to write a custom conversion script if your dataset is not on the list.\n",
        "2. Modify the config for testing. \n",
        "3. Test the model. \n",
        "\n",
        "Now we will demonstrate how to test a model on different datasets.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xxehFhP7N_xf"
      },
      "source": [
        "### Toy Dataset\n",
        "\n",
        "With the checkpoint we obtained from the last section, we can evaluate it on the toy dataset again. Some more explanataions about the evaulation metrics are available [here](https://mmocr.readthedocs.io/en/dev-1.x/basic_concepts/evaluation.html). "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PHneq5LxRT6z"
      },
      "outputs": [],
      "source": [
        "from mmengine.runner import Runner\n",
        "import time\n",
        "\n",
        "# The location of pretrained weight\n",
        "cfg['load_from'] = 'work_dirs/sar_resnet31_parallel-decoder_5e_toy/epoch_100.pth'\n",
        "\n",
        "# Optionally, give visualizer a unique name to avoid dupliate instance being\n",
        "# created in multiple runs\n",
        "cfg.visualizer.name = f'{time.localtime()}'\n",
        "\n",
        "runner = Runner.from_cfg(cfg)\n",
        "runner.test()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TXmgGRcjOba2"
      },
      "source": [
        "It's also possible to evaluate with a stronger and more generalized pretrained weight, which were trained on larger datasets and achieved quite competitve acadmical performance, though it may not defeat the previous checkpoint overfitted to the toy dataset. ([readme](https://mmocr.readthedocs.io/en/dev-1.x/textrecog_models.html#sar))\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Uj8qFQjYRCxK"
      },
      "outputs": [],
      "source": [
        "# The location of pretrained weight\n",
        "cfg['load_from'] = 'https://download.openmmlab.com/mmocr/textrecog/sar/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real_20220915_171910-04eb4e75.pth'\n",
        "cfg.visualizer.name = f'{time.localtime()}'\n",
        "runner = Runner.from_cfg(cfg)\n",
        "runner.test()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WctaeVsYR4W_"
      },
      "source": [
        "### SVTP Dataset\n",
        "\n",
        "SVTP dataset is one of the six commonly used academic test sets that systematically reflects a text recognizer's performance. Now we will evaluate SAR on this dataset, and we are going to use [Dataset Preparer](https://mmocr.readthedocs.io/en/dev-1.x/user_guides/data_prepare/dataset_preparer.html) to get it prepared first."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3VEW3PQrFZ0g"
      },
      "outputs": [],
      "source": [
        "!python tools/dataset_converters/prepare_dataset.py svtp --task textrecog"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "taW5YraiBVNx"
      },
      "source": [
        "SVTP is now available in `data/svtp`, and the dataset config is available at `configs/textrecog/_base_/datasets/svtp.py`. Now we first point the `test_dataloader` to SVTP, then perform testing with the overfitted checkpoint. As this checkpoint is just overfitted to such a small dataset, it's not surprising that it performs well on the toy dataset and bad on SVTP."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "p0MHNwybo0iI"
      },
      "outputs": [],
      "source": [
        "from mmengine import Config\n",
        "\n",
        "svtp_cfg = Config.fromfile('configs/textrecog/_base_/datasets/svtp.py')\n",
        "svtp_cfg.svtp_textrecog_test.pipeline = cfg.test_pipeline\n",
        "cfg.test_dataloader.dataset = svtp_cfg.svtp_textrecog_test\n",
        "\n",
        "# The location of pretrained weight\n",
        "cfg['load_from'] = 'work_dirs/sar_resnet31_parallel-decoder_5e_toy/epoch_100.pth'\n",
        "\n",
        "# Optionally, give visualizer a unique name to avoid dupliate instance being\n",
        "# created in multiple runs\n",
        "cfg.visualizer.name = f'{time.localtime()}'\n",
        "\n",
        "runner = Runner.from_cfg(cfg)\n",
        "runner.test()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cLCHDGCbD3HG"
      },
      "source": [
        "Let's evaluate the pretrained one for comparision."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7S50EImkCvc7"
      },
      "outputs": [],
      "source": [
        "# The location of pretrained weight\n",
        "cfg['load_from'] = 'https://download.openmmlab.com/mmocr/textrecog/sar/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real_20220915_171910-04eb4e75.pth'\n",
        "cfg.visualizer.name = f'{time.localtime()}'\n",
        "runner = Runner.from_cfg(cfg)\n",
        "runner.test()"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.8.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}