History

fanqiNO1 7cbfb36c14 [Refactor] Fix spelling (#1681 ) * [Refactor] Fix spelling * [Refactor] Fix spelling * [Refactor] Fix spelling * [Refactor] Fix spelling		2023-07-05 11:07:43 +08:00
..
README.md	[Refactor] Fix spelling (#1681 )	2023-07-05 11:07:43 +08:00
blip-base_8xb16_refcoco.py	[Feature] Support multiple multi-modal algorithms and inferencers. (#1561 )	2023-05-19 16:50:04 +08:00
blip-base_8xb32_caption.py	[Feature] Support multiple multi-modal algorithms and inferencers. (#1561 )	2023-05-19 16:50:04 +08:00
blip-base_8xb32_caption_flickr30k.py	[Feature] Support Flickr30k Retrieval dataset (#1625 )	2023-06-19 15:15:03 +08:00
blip-base_8xb32_nlvr.py	[Feature] Support multiple multi-modal algorithms and inferencers. (#1561 )	2023-05-19 16:50:04 +08:00
blip-base_8xb32_nocaps.py	[Feature] support TextVQA dataset (#1596 )	2023-06-02 11:50:38 +08:00
blip-base_8xb32_ocrvqa.py	[Feature] Support OCR-VQA dataset (#1621 )	2023-06-13 10:28:45 +08:00
blip-base_8xb32_okvqa.py	[Feature] Support OK-VQA dataset (#1615 )	2023-06-08 16:57:18 +08:00
blip-base_8xb32_retrieval.py	[Feature] Support multiple multi-modal algorithms and inferencers. (#1561 )	2023-05-19 16:50:04 +08:00
blip-base_8xb32_retrieval_flickr30k.py	[Feature] Support Flickr30k Retrieval dataset (#1625 )	2023-06-19 15:15:03 +08:00
blip-base_8xb32_vqa.py	[Feature] Support multiple multi-modal algorithms and inferencers. (#1561 )	2023-05-19 16:50:04 +08:00
metafile.yml	Add benchmark options (#1654 )	2023-06-20 14:18:57 +08:00

README.md

BLIP

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

How to use it?

Use the model

from mmpretrain import inference_model

result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a puppy and a cat sitting on a blanket'}

Test Command

Prepare your dataset according to the docs.

Test:

python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth

Models and results

Image Caption on COCO

Model	Params (M)	BLEU-4	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	40.12	132.82	config	model

Image Caption on NoCaps

Model	Params (M)	SPICE	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	14.69	109.12	config	model

Image Caption on Flickr30k

Model	Params (M)	SPICE	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	15.58	68.89	config	model

Visual Grounding on RefCOCO

Model	Params (M)	Accuracy (testA)	Accuracy (testB)	Config	Download
`blip-base_8xb16_refcoco`	498.49	86.14	77.33	config	model \| log

Visual Question Answering on VQAv2

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	78.20	config	model

Visual Question Answering on OK-VQA

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	40.59#	config	model

Visual Question Answering on OCR-VQA

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	28.30#	config	model

Image-To-Text Retrieval on COCO

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	82.52	95.34	config	model

Text-To-Image Retrieval on COCO

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	64.82	86.28	config	model

Image-To-Text Retrieval on Flickr30k

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	95.10#	99.60#	config	model

Text-To-Image Retrieval on Flickr30k

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	85.26#	96.58#	config	model

NLVR on NLVR2

Model	Params (M)	Top-1 (%)	Config	Download
`blip-base_3rdparty_nlvr`*	259.37	82.33	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.

Citation

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}