PaddleOCR/ocr_datasets_en.md at fcafba64c0eaf804ec4d1578c4d1afe4cc4004fe

mirror of https://github.com/PaddlePaddle/PaddleOCR.git synced 2025-06-03 21:53:39 +08:00

WenmuZhou fcafba64c0 add en dataset, test=document_fix

2022-04-26 22:30:42 +08:00

OCR datasets

Here is a list of public datasets commonly used in OCR, which are being continuously updated. Welcome to contribute datasets~

dataset	Image download link	PPOCR format annotation download link
ICDAR 2015	https://rrc.cvc.uab.es/?ch=4&com=downloads	train / test
ctw1500	https://paddleocr.bj.bcebos.com/dataset/ctw1500.zip	Included in the downloaded image zip
total text	https://paddleocr.bj.bcebos.com/dataset/total_text.tar	Included in the downloaded image zip

dataset	Image download link	PPOCR format annotation download link
en benchmark(MJ, SJ, IIIT, SVT, IC03, IC13, IC15, SVTP, and CUTE.)	DTRB	LMDB format, which can be loaded directly with lmdb_dataset.py