7.4 KiB
ABINet
Abstract
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.

Dataset
Train Dataset
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
SynthText | 7239272 | 1 | alphanumeric |
Test Dataset
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models
methods | pretrained | Regular Text | Irregular Text | download | ||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
ABINet-Vision | - | 0.9523 | 0.9196 | 0.9369 | 0.7896 | 0.8403 | 0.8437 | model | log |
ABINet-Vision-TTA | - | 0.9523 | 0.9196 | 0.9360 | 0.8175 | 0.8450 | 0.8542 | |
ABINet | Pretrained | 0.9603 | 0.9397 | 0.9557 | 0.8146 | 0.8868 | 0.8785 | model | log |
ABINet-TTA | Pretrained | 0.9597 | 0.9397 | 0.9527 | 0.8426 | 0.8930 | 0.8854 |
1. ABINet allows its encoder to run and be trained without decoder and fuser. Its encoder is designed to recognize texts as a stand-alone model and therefore can work as an independent text recognizer. We release it as ABINet-Vision.
2. Facts about the pretrained model: MMOCR does not have a systematic pipeline to pretrain the language model (LM) yet, thus the weights of LM are converted from [the official pretrained model](https://github.com/FangShancheng/ABINet). The weights of ABINet-Vision are directly used as the vision model of ABINet.
We also provide ABINet trained on Union14M
-
Evaluated on six common benchmarks
methods pretrained Regular Text Irregular Text download IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80 ABINet-Vision - 0.9730 0.9645 0.9552 0.8536 0.8977 0.9479 model -
Evaluated on Union14M-Benchmark
Methods Unsolved Challenges Additional Challenges General download Curve Multi-Oriented Artistic Contextless Salient Multi-Words Incomplete General ABINet-Vision 0.750 0.615 0.653 0.711 0.729 0.591 0.026 0.794 model
Citation
@article{fang2021read,
title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}