Compare commits

...

62 Commits

Author SHA1 Message Date
Ren Tianhe 856dde20ae
Grounded SAM 2 Release 2024-08-12 16:52:02 +08:00
Piotr Skalski 5a890bd867
Merge pull request #342 from ethanlee928/main
fix Supervision depreciation of BoxAnnotator
2024-07-24 08:59:41 +02:00
Piotr Skalski e49e881edd
Merge branch 'main' into main 2024-07-24 08:58:19 +02:00
ethanlee928 8b6a55f612 replaced BoundingBoxAnnotator with BoxAnnotator, updated Supervision version 2024-07-23 23:19:52 +08:00
Piotr Skalski e27a646ca0
Update requirements.txt
`supervision==0.22.0` is deprecating `BoxAnnotator`. I'm freezing the `supervision` version to prevent any problems.
2024-07-12 12:27:24 +02:00
ethanlee928 d75c95daf6 fix Supervision depreciation of BoxAnnotator 2024-06-29 01:10:48 +08:00
Ren Tianhe df5b48a3ef
Update README.md 2024-05-23 20:10:37 +08:00
Ren Tianhe 4330960fa7
Grounding DINO 1.5 Release 2024-05-18 13:36:18 +08:00
JunX 16e0ccdb7d
Update gradio_app.py (#318) 2024-04-22 16:01:29 +08:00
Ikko Eltociear Ashimine 3a2b344737
Update README.md (#322)
Performancce -> Performance
2024-04-14 13:57:39 +08:00
Rohan Manzoor c023468faf
Added Dockerfile along with a file to test Docker (#307) 2024-03-11 16:41:39 +08:00
ASHWIN UNNIKRISHNAN d13643262e
Update inference.py (#298) 2024-02-23 15:10:00 +08:00
Mehmet Deniz Birlikci 2b62f419c2
Update setup.py (#269) 2023-12-31 09:22:45 +08:00
Hardik Dava 27024e42da
Update requirements.txt (#265) 2023-12-19 23:33:25 +08:00
Songming Liu 16e6b4bfcf
Fix an incorrect link in README (#254) 2023-11-25 22:07:07 +08:00
sdy623 03198a2a79
Add the environment.yaml for Anaconda3 (#229) 2023-11-13 13:53:24 +08:00
Kazuto Murase fbb2532bb0
eval empty token_spans properly (#191) 2023-11-13 13:53:12 +08:00
Shiyu eeba084341
[fix] replace ema_model with model in demo/test_ap_on_coco (#242) 2023-11-13 13:52:37 +08:00
jishnujp-vp 60d796825e
decoupled image processing from the main flow (#160) 2023-07-22 22:08:59 -07:00
Tony Wang 5bb6543346
Readme: Add more Installation details (#177)
* test functionality

* add more steps for installation, so the CUDA_HOME can be set correctly
2023-07-22 22:08:41 -07:00
Shilong Liu b520c15790
Update README.md
add semantic sam
2023-07-18 14:20:51 -07:00
Ren Tianhe 6c27bc76b9
Update README.md 2023-06-29 17:03:15 +08:00
SlongLiu c4c2d69fb4 fix readme for phrase grounding mode 2023-06-29 14:14:28 +08:00
SlongLiu 2452fa38d5 Merge branch 'main' of https://github.com/IDEA-Research/GroundingDINO into main 2023-06-29 14:11:54 +08:00
SlongLiu a0cc07e12f support phrase grounding mode 2023-06-29 14:11:35 +08:00
Ren Tianhe 4605649b77
Update README.md 2023-06-28 00:18:24 +08:00
Ren Tianhe beeb4c29cb
Update README.md 2023-06-20 11:56:10 +08:00
Mohamad Al Mdfaa 9389fa492b
fix: improve phrases2classes implementation (#143)
This commit improves the phrases2classes implementation by using a regular expression to match sub-phrases in the phrases list. This makes the implementation more accurate and efficient.
2023-06-17 02:36:16 -07:00
Shilong Liu 16292e162d
support coco evaluation (#149) 2023-06-17 02:31:07 -07:00
Ren Tianhe 4e6f23d35c
Add logo for Grounding-DINO (#144) 2023-06-13 22:00:19 -07:00
Ren Tianhe 6225f464da
Add logo file 2023-06-14 12:27:55 +08:00
HaoRan-hash 9a96ef055c
Solve combined categories (#125)
* Update inference.py
2023-06-07 11:48:08 -07:00
Piotr Skalski 31aa788a3c
🛠️ Fixing typos in README.md 2023-05-23 20:25:50 +02:00
Liu, Hao 427aebd59a
<Feat>: use local transformer model (#110)
<Detail>:

<Footer>:
2023-05-22 15:10:04 +08:00
Karim Umar 39b1472457
minor typo in README (#99)
Co-authored-by: root <root@vmi1286032.contaboserver.net>
2023-05-12 21:43:47 +08:00
Ren Tianhe 654f5e8bf9
Highlight DetGPT 2023-05-10 11:03:10 +08:00
Ren Tianhe 67bb0b634a
Refine README (#89)
* refine readme

* refine
2023-05-06 16:40:39 +08:00
Ren Tianhe 88a8cd6258
Update Citation 2023-05-06 15:56:00 +08:00
Ren Tianhe db4e6d9680
Merge pull request #87 from darshats/main
create "." separated caption
2023-05-04 01:56:40 +08:00
Darshat Shah 168d65d5c4 create "." separated caption 2023-05-03 23:23:32 +05:30
rentainhe a4dcf5d411 fix bug 2023-05-02 19:41:34 +08:00
Ren Tianhe 0dc5ece5a2
Merge pull request #40 from eltociear/patch-1
Update README.md
2023-05-02 17:40:10 +08:00
Ren Tianhe 55d5f31b70
Merge pull request #77 from darshats/main
use model.device when calling legacy predict
2023-05-02 17:36:30 +08:00
Ren Tianhe 562643e178
Merge pull request #79 from pooya-mohammadi/main
Move GroundingDINO_SwinB.cfg.py to GroundingDINO_SwinB_cfg.py
2023-05-02 17:33:06 +08:00
pooya-mohammadi 92766784b0 Move GroundingDINO_SwinB.cfg.py to GroundingDINO_SwinB_cfg.py 2023-04-27 23:11:42 +04:30
Darshat Shah ff94310921 use model.device when calling legacy predict 2023-04-27 12:15:11 +05:30
Ren Tianhe 498048b1b2
Merge pull request #76 from ahmedosman2001/main
Updated README.md
2023-04-26 22:36:10 +08:00
ahmedosman2001 d851b00ed0
Merge pull request #1 from ahmedosman2001/ahmedosman2001-patch-1
Updated README.md
2023-04-26 13:58:37 +01:00
ahmedosman2001 b091a5bb20
Updated README.md
Improved installation and usage instructions.
2023-04-26 13:53:00 +01:00
Piotr Skalski da9f1c0751
Bump `supervision` version to `0.6.0`. 2023-04-21 18:37:17 +02:00
Piotr Skalski 95e0123a14
Add link to Accelerate Image Annotation with SAM and Grounding DINO | Python Tutorial 2023-04-20 21:10:18 +02:00
Dowon 57535c5a79
fix: setup.py TORCH_CUDA_ARCH_LIST (#62) 2023-04-19 11:36:28 +08:00
SlongLiu c43cdb3a95 update cvinw readings 2023-04-15 22:31:21 +08:00
SlongLiu bd61f50091 update tips 2023-04-12 18:40:11 +08:00
SlongLiu dbe0ad8f21 add readme for explainations 2023-04-12 18:11:40 +08:00
Zekun Zhang 049566bdc9
Fix argument parsing bug (#43)
text_threshold was wrongly set by args.box_threshold
2023-04-12 17:18:47 +08:00
Luca Medeiros 19e699c635
add init to datasets (#42) 2023-04-12 13:05:27 +08:00
Ikko Eltociear Ashimine 428ef7fab4
Update README.md
Github -> GitHub
2023-04-12 01:09:25 +09:00
Shilong Liu 9dac4c605b
fix windows bugs (#30) 2023-04-09 22:08:36 +08:00
SlongLiu 3bb2c86c9a update readme with gd-swinb hf links 2023-04-08 16:52:18 +08:00
SlongLiu d3bc35fdea update gligen 2023-04-08 16:38:19 +08:00
SlongLiu 15ade007a8 add grounding dino - B 2023-04-07 17:37:00 +08:00
26 changed files with 2058 additions and 101 deletions

BIN
.asset/cat_dog.jpeg 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 354 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 472 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 456 KiB

35
Dockerfile 100644
View File

@ -0,0 +1,35 @@
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
ARG DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda \
TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX" \
SETUPTOOLS_USE_DISTUTILS=stdlib
RUN conda update conda -y
# Install libraries in the brand new image.
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
build-essential \
git \
python3-opencv \
ca-certificates && \
rm -rf /var/lib/apt/lists/*
# Set the working directory for all the subsequent Dockerfile instructions.
WORKDIR /opt/program
RUN git clone https://github.com/IDEA-Research/GroundingDINO.git
RUN mkdir weights ; cd weights ; wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth ; cd ..
RUN conda install -c "nvidia/label/cuda-12.1.1" cuda -y
ENV CUDA_HOME=$CONDA_PREFIX
ENV PATH=/usr/local/cuda/bin:$PATH
RUN cd GroundingDINO/ && python -m pip install .
COPY docker_test.py docker_test.py
CMD [ "python", "docker_test.py" ]

View File

@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2020 - present, Facebook, Inc
Copyright 2023 - present, IDEA Research.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

287
README.md
View File

@ -1,41 +1,85 @@
# Grounding DINO
<div align="center">
<img src="./.asset/grounding_dino_logo.png" width="30%">
</div>
---
# :sauropod: Grounding DINO
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-mscoco)](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-odinw)](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=grounding-dino-marrying-dino-with-grounded)
Grounding DINO Methods | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/IDEA-Research/GroundingDINO)
**[IDEA-CVR, IDEA-Research](https://github.com/IDEA-Research)**
[Shilong Liu](http://www.lsl.zone/), [Zhaoyang Zeng](https://scholar.google.com/citations?user=U_cvvUwAAAAJ&hl=zh-CN&oi=ao), [Tianhe Ren](https://rentainhe.github.io/), [Feng Li](https://scholar.google.com/citations?user=ybRe9GcAAAAJ&hl=zh-CN), [Hao Zhang](https://scholar.google.com/citations?user=B8hPxMQAAAAJ&hl=zh-CN), [Jie Yang](https://github.com/yangjie-cv), [Chunyuan Li](https://scholar.google.com/citations?user=Zd7WmXUAAAAJ&hl=zh-CN&oi=ao), [Jianwei Yang](https://jwyang.github.io/), [Hang Su](https://scholar.google.com/citations?hl=en&user=dxN1_X0AAAAJ&view_op=list_works&sortby=pubdate), [Jun Zhu](https://scholar.google.com/citations?hl=en&user=axsP38wAAAAJ), [Lei Zhang](https://www.leizhang.org/)<sup>:email:</sup>.
[[`Paper`](https://arxiv.org/abs/2303.05499)] [[`Demo`](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)] [[`BibTex`](#black_nib-citation)]
PyTorch implementation and pretrained models for Grounding DINO. For details, see the paper **[Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)**.
- 🔥 **[Grounded SAM 2](https://github.com/IDEA-Research/Grounded-SAM-2)** is released now, which combines Grounding DINO with [SAM 2](https://github.com/facebookresearch/segment-anything-2) for any object tracking in open-world scenarios.
- 🔥 **[Grounding DINO 1.5](https://github.com/IDEA-Research/Grounding-DINO-1.5-API)** is released now, which is IDEA Research's **Most Capable** Open-World Object Detection Model!
- 🔥 **[Grounding DINO](https://arxiv.org/abs/2303.05499)** and **[Grounded SAM](https://arxiv.org/abs/2401.14159)** are now supported in Huggingface. For more convenient use, you can refer to [this documentation](https://huggingface.co/docs/transformers/model_doc/grounding-dino)
## :sun_with_face: Helpful Tutorial
- :grapes: [[Read our arXiv Paper](https://arxiv.org/abs/2303.05499)]
- :apple: [[Watch our simple introduction video on YouTube](https://youtu.be/wxWDt5UiwY8)]
- :blossom: &nbsp;[[Try the Colab Demo](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)]
- :sunflower: [[Try our Official Huggingface Demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)]
- :maple_leaf: [[Watch the Step by Step Tutorial about GroundingDINO by Roboflow AI](https://youtu.be/cMa77r3YrDk)]
- :mushroom: [[GroundingDINO: Automated Dataset Annotation and Evaluation by Roboflow AI](https://youtu.be/C4NqaRBz_Kw)]
- :hibiscus: [[Accelerate Image Annotation with SAM and GroundingDINO by Roboflow AI](https://youtu.be/oEQYStnF2l8)]
- :white_flower: [[Autodistill: Train YOLOv8 with ZERO Annotations based on Grounding-DINO and Grounded-SAM by Roboflow AI](https://github.com/autodistill/autodistill)]
<!-- Grounding DINO Methods |
[![arXiv](https://img.shields.io/badge/arXiv-2303.05499-b31b1b.svg)](https://arxiv.org/abs/2303.05499)
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/wxWDt5UiwY8)
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/wxWDt5UiwY8) -->
Grounding DINO Demos |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk)
<!-- Grounding DINO Demos |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) -->
<!-- [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/cMa77r3YrDk)
[![HuggingFace space](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/C4NqaRBz_Kw)
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/oEQYStnF2l8)
[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/C4NqaRBz_Kw) -->
Extensions | [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb);
[Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
## :sparkles: Highlight Projects
- [Semantic-SAM: a universal image segmentation model to enable segment and recognize anything at any desired granularity.](https://github.com/UX-Decoder/Semantic-SAM),
- [DetGPT: Detect What You Need via Reasoning](https://github.com/OptimalScale/DetGPT)
- [Grounded-SAM: Marrying Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
- [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb)
- [Grounding DINO with GLIGEN for Controllable Image Editing](demo/image_editing_with_groundingdino_gligen.ipynb)
- [OpenSeeD: A Simple and Strong Openset Segmentation Model](https://github.com/IDEA-Research/OpenSeeD)
- [SEEM: Segment Everything Everywhere All at Once](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
- [X-GPT: Conversational Visual Agent supported by X-Decoder](https://github.com/microsoft/X-Decoder/tree/xgpt)
- [GLIGEN: Open-Set Grounded Text-to-Image Generation](https://github.com/gligen/GLIGEN)
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
<!-- Extensions | [Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything); [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb); [Grounding DINO with GLIGEN](demo/image_editing_with_groundingdino_gligen.ipynb) -->
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-mscoco)](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) \
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/zero-shot-object-detection-on-odinw)](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) \
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/grounding-dino-marrying-dino-with-grounded/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=grounding-dino-marrying-dino-with-grounded)
<!-- Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now! -->
Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now!
## Highlight
## :bulb: Highlight
- **Open-Set Detection.** Detect **everything** with language!
- **High Performancce.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
- **High Performance.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
- **Flexible.** Collaboration with Stable Diffusion for Image Editting.
## News
- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) aims to support segmentation in GroundingDINO.
## :fire: News
- **`2023/07/18`**: We release [Semantic-SAM](https://github.com/UX-Decoder/Semantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity. **Code** and **checkpoint** are available!
- **`2023/06/17`**: We provide an example to evaluate Grounding DINO on COCO zero-shot performance.
- **`2023/04/15`**: Refer to [CV in the Wild Readings](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings) for those who are interested in open-set recognition!
- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN) for more controllable image editings.
- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named **[Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)** aims to support segmentation in GroundingDINO.
- **`2023/03/28`**: A YouTube [video](https://youtu.be/cMa77r3YrDk) about Grounding DINO and basic object detection prompt engineering. [[SkalskiP](https://github.com/SkalskiP)]
- **`2023/03/28`**: Add a [demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) on Hugging Face Space!
- **`2023/03/27`**: Support CPU-only mode. Now the model can run on machines without GPUs.
@ -46,44 +90,184 @@ Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.0
<summary><font size="4">
Description
</font></summary>
<a href="https://arxiv.org/abs/2303.05499">Paper</a> introduction.
<img src=".asset/hero_figure.png" alt="ODinW" width="100%">
Marrying <a href="https://github.com/IDEA-Research/GroundingDINO">Grounding DINO</a> and <a href="https://github.com/gligen/GLIGEN">GLIGEN</a>
<img src="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/GD_GLIGEN.png" alt="gd_gligen" width="100%">
</details>
## :star: Explanations/Tips for Grounding DINO Inputs and Outputs
- Grounding DINO accepts an `(image, text)` pair as inputs.
- It outputs `900` (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
- We defaultly choose the boxes whose highest similarities are higher than a `box_threshold`.
- We extract the words whose similarities are higher than the `text_threshold` as predicted labels.
- If you want to obtain objects of specific phrases, like the `dogs` in the sentence `two dogs with a stick.`, you can select the boxes with highest text similarities with `dogs` as final outputs.
- Note that each word can be split to **more than one** tokens with different tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
- We suggest separating different category names with `.` for Grounding DINO.
![model_explain1](.asset/model_explan1.PNG)
![model_explain2](.asset/model_explan2.PNG)
## TODO
## :label: TODO
- [x] Release inference code and demo.
- [x] Release checkpoints.
- [ ] Grounding DINO with Stable Diffusion and GLIGEN demos.
- [x] Grounding DINO with Stable Diffusion and GLIGEN demos.
- [ ] Release training codes.
## Install
## :hammer_and_wrench: Install
If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.
**Note:**
0. If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.
Please make sure following the installation steps strictly, otherwise the program may produce:
```bash
NameError: name '_C' is not defined
```
If this happened, please reinstalled the groundingDINO by reclone the git and do all the installation steps again.
#### how to check cuda:
```bash
echo $CUDA_HOME
```
If it print nothing, then it means you haven't set up the path/
Run this so the environment variable will be set under current shell.
```bash
export CUDA_HOME=/path/to/cuda-11.3
```
Notice the version of cuda should be aligned with your CUDA runtime, for there might exists multiple cuda at the same time.
If you want to set the CUDA_HOME permanently, store it using:
```bash
echo 'export CUDA_HOME=/path/to/cuda' >> ~/.bashrc
```
after that, source the bashrc file and check CUDA_HOME:
```bash
source ~/.bashrc
echo $CUDA_HOME
```
In this example, /path/to/cuda-11.3 should be replaced with the path where your CUDA toolkit is installed. You can find this by typing **which nvcc** in your terminal:
For instance,
if the output is /usr/local/cuda/bin/nvcc, then:
```bash
export CUDA_HOME=/usr/local/cuda
```
**Installation:**
1.Clone the GroundingDINO repository from GitHub.
```bash
git clone https://github.com/IDEA-Research/GroundingDINO.git
```
2. Change the current directory to the GroundingDINO folder.
```bash
cd GroundingDINO/
```
3. Install the required dependencies in the current directory.
```bash
pip install -e .
```
## Demo
4. Download pre-trained model weights.
```bash
CUDA_VISIBLE_DEVICES=6 python demo/inference_on_a_image.py \
-c /path/to/config \
-p /path/to/checkpoint \
-i .asset/cats.png \
-o "outputs/0" \
-t "cat ear." \
[--cpu-only] # open it for cpu mode
mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..
```
## :arrow_forward: Demo
Check your GPU ID (only if you're using a GPU)
```bash
nvidia-smi
```
Replace `{GPU ID}`, `image_you_want_to_detect.jpg`, and `"dir you want to save the output"` with appropriate values in the following command
```bash
CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p weights/groundingdino_swint_ogc.pth \
-i image_you_want_to_detect.jpg \
-o "dir you want to save the output" \
-t "chair"
[--cpu-only] # open it for cpu mode
```
If you would like to specify the phrases to detect, here is a demo:
```bash
CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p ./groundingdino_swint_ogc.pth \
-i .asset/cat_dog.jpeg \
-o logs/1111 \
-t "There is a cat and a dog in the image ." \
--token_spans "[[[9, 10], [11, 14]], [[19, 20], [21, 24]]]"
[--cpu-only] # open it for cpu mode
```
The token_spans specify the start and end positions of a phrases. For example, the first phrase is `[[9, 10], [11, 14]]`. `"There is a cat and a dog in the image ."[9:10] = 'a'`, `"There is a cat and a dog in the image ."[11:14] = 'cat'`. Hence it refers to the phrase `a cat` . Similarly, the `[[19, 20], [21, 24]]` refers to the phrase `a dog`.
See the `demo/inference_on_a_image.py` for more details.
**Running with Python:**
```python
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2
model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
IMAGE_PATH = "weights/dog-3.jpeg"
TEXT_PROMPT = "chair . person . dog ."
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25
image_source, image = load_image(IMAGE_PATH)
boxes, logits, phrases = predict(
model=model,
image=image,
caption=TEXT_PROMPT,
box_threshold=BOX_TRESHOLD,
text_threshold=TEXT_TRESHOLD
)
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
cv2.imwrite("annotated_image.jpg", annotated_frame)
```
**Web UI**
We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See the file `demo/gradio_app.py` for more details.
## Checkpoints
**Notebooks**
- We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN) for more controllable image editings.
- We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
## COCO Zero-shot Evaluations
We provide an example to evaluate Grounding DINO zero-shot performance on COCO. The results should be **48.5**.
```bash
CUDA_VISIBLE_DEVICES=0 \
python demo/test_ap_on_coco.py \
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p weights/groundingdino_swint_ogc.pth \
--anno_path /path/to/annoataions/ie/instances_val2017.json \
--image_dir /path/to/imagedir/ie/val2017
```
## :luggage: Checkpoints
<!-- insert a table -->
<table>
@ -105,13 +289,22 @@ We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See
<td>Swin-T</td>
<td>O365,GoldG,Cap4M</td>
<td>48.4 (zero-shot) / 57.2 (fine-tune)</td>
<td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">Github link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
<td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">GitHub link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
<td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinT_OGC.py">link</a></td>
</tr>
<tr>
<th>2</th>
<td>GroundingDINO-B</td>
<td>Swin-B</td>
<td>COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO</td>
<td>56.7 </td>
<td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth">GitHub link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swinb_cogcoor.pth">HF link</a>
<td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinB_cfg.py">link</a></td>
</tr>
</tbody>
</table>
## Results
## :medal_military: Results
<details open>
<summary><font size="4">
@ -131,26 +324,27 @@ ODinW Object Detection Results
<summary><font size="4">
Marrying Grounding DINO with <a href="https://github.com/Stability-AI/StableDiffusion">Stable Diffusion</a> for Image Editing
</font></summary>
See our example: demo/image_editing_with_groundingdino_stablediffusion.ipynb .
See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_stablediffusion.ipynb">notebook</a> for more details.
<img src=".asset/GD_SD.png" alt="GD_SD" width="100%">
</details>
<details open>
<summary><font size="4">
Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing
Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing.
</font></summary>
See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_gligen.ipynb">notebook</a> for more details.
<img src=".asset/GD_GLIGEN.png" alt="GD_GLIGEN" width="100%">
</details>
## Model
## :sauropod: Model: Grounding DINO
Includes: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder.
![arch](.asset/arch.png)
## Acknowledgement
## :hearts: Acknowledgement
Our model is related to [DINO](https://github.com/IDEA-Research/DINO) and [GLIP](https://github.com/microsoft/GLIP). Thanks for their great work!
@ -159,14 +353,15 @@ We also thank great previous work including DETR, Deformable DETR, SMCA, Conditi
Thanks [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) and [GLIGEN](https://github.com/gligen/GLIGEN) for their awesome models.
## Citation
## :black_nib: Citation
If you find our work helpful for your research, please consider citing the following BibTeX entry.
```bibtex
@inproceedings{ShilongLiu2023GroundingDM,
title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
@article{liu2023grounding,
title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
journal={arXiv preprint arXiv:2303.05499},
year={2023}
}
```

View File

@ -16,7 +16,7 @@ import torch
# prepare the environment
os.system("python setup.py build develop --user")
os.system("pip install packaging==21.3")
os.system("pip install gradio")
os.system("pip install gradio==3.50.2")
warnings.filterwarnings("ignore")

File diff suppressed because one or more lines are too long

View File

@ -11,6 +11,7 @@ from groundingdino.models import build_model
from groundingdino.util import box_ops
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
from groundingdino.util.vl_utils import create_positive_map_from_span
def plot_boxes_to_image(image_pil, tgt):
@ -80,7 +81,8 @@ def load_model(model_config_path, model_checkpoint_path, cpu_only=False):
return model
def get_grounding_output(model, image, caption, box_threshold, text_threshold, with_logits=True, cpu_only=False):
def get_grounding_output(model, image, caption, box_threshold, text_threshold=None, with_logits=True, cpu_only=False, token_spans=None):
assert text_threshold is not None or token_spans is not None, "text_threshould and token_spans should not be None at the same time!"
caption = caption.lower()
caption = caption.strip()
if not caption.endswith("."):
@ -90,29 +92,56 @@ def get_grounding_output(model, image, caption, box_threshold, text_threshold, w
image = image.to(device)
with torch.no_grad():
outputs = model(image[None], captions=[caption])
logits = outputs["pred_logits"].cpu().sigmoid()[0] # (nq, 256)
boxes = outputs["pred_boxes"].cpu()[0] # (nq, 4)
logits.shape[0]
logits = outputs["pred_logits"].sigmoid()[0] # (nq, 256)
boxes = outputs["pred_boxes"][0] # (nq, 4)
# filter output
logits_filt = logits.clone()
boxes_filt = boxes.clone()
filt_mask = logits_filt.max(dim=1)[0] > box_threshold
logits_filt = logits_filt[filt_mask] # num_filt, 256
boxes_filt = boxes_filt[filt_mask] # num_filt, 4
logits_filt.shape[0]
if token_spans is None:
logits_filt = logits.cpu().clone()
boxes_filt = boxes.cpu().clone()
filt_mask = logits_filt.max(dim=1)[0] > box_threshold
logits_filt = logits_filt[filt_mask] # num_filt, 256
boxes_filt = boxes_filt[filt_mask] # num_filt, 4
# get phrase
tokenlizer = model.tokenizer
tokenized = tokenlizer(caption)
# build pred
pred_phrases = []
for logit, box in zip(logits_filt, boxes_filt):
pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
if with_logits:
pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
else:
pred_phrases.append(pred_phrase)
else:
# given-phrase mode
positive_maps = create_positive_map_from_span(
model.tokenizer(text_prompt),
token_span=token_spans
).to(image.device) # n_phrase, 256
logits_for_phrases = positive_maps @ logits.T # n_phrase, nq
all_logits = []
all_phrases = []
all_boxes = []
for (token_span, logit_phr) in zip(token_spans, logits_for_phrases):
# get phrase
phrase = ' '.join([caption[_s:_e] for (_s, _e) in token_span])
# get mask
filt_mask = logit_phr > box_threshold
# filt box
all_boxes.append(boxes[filt_mask])
# filt logits
all_logits.append(logit_phr[filt_mask])
if with_logits:
logit_phr_num = logit_phr[filt_mask]
all_phrases.extend([phrase + f"({str(logit.item())[:4]})" for logit in logit_phr_num])
else:
all_phrases.extend([phrase for _ in range(len(filt_mask))])
boxes_filt = torch.cat(all_boxes, dim=0).cpu()
pred_phrases = all_phrases
# get phrase
tokenlizer = model.tokenizer
tokenized = tokenlizer(caption)
# build pred
pred_phrases = []
for logit, box in zip(logits_filt, boxes_filt):
pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
if with_logits:
pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
else:
pred_phrases.append(pred_phrase)
return boxes_filt, pred_phrases
@ -132,6 +161,12 @@ if __name__ == "__main__":
parser.add_argument("--box_threshold", type=float, default=0.3, help="box threshold")
parser.add_argument("--text_threshold", type=float, default=0.25, help="text threshold")
parser.add_argument("--token_spans", type=str, default=None, help=
"The positions of start and end positions of phrases of interest. \
For example, a caption is 'a cat and a dog', \
if you would like to detect 'cat', the token_spans should be '[[[2, 5]], ]', since 'a cat and a dog'[2:5] is 'cat'. \
if you would like to detect 'a cat', the token_spans should be '[[[0, 1], [2, 5]], ]', since 'a cat and a dog'[0:1] is 'a', and 'a cat and a dog'[2:5] is 'cat'. \
")
parser.add_argument("--cpu-only", action="store_true", help="running on cpu only!, default=False")
args = parser.parse_args()
@ -143,7 +178,8 @@ if __name__ == "__main__":
text_prompt = args.text_prompt
output_dir = args.output_dir
box_threshold = args.box_threshold
text_threshold = args.box_threshold
text_threshold = args.text_threshold
token_spans = args.token_spans
# make dir
os.makedirs(output_dir, exist_ok=True)
@ -155,9 +191,15 @@ if __name__ == "__main__":
# visualize raw image
image_pil.save(os.path.join(output_dir, "raw_image.jpg"))
# set the text_threshold to None if token_spans is set.
if token_spans is not None:
text_threshold = None
print("Using token_spans. Set the text_threshold to None.")
# run model
boxes_filt, pred_phrases = get_grounding_output(
model, image, text_prompt, box_threshold, text_threshold, cpu_only=args.cpu_only
model, image, text_prompt, box_threshold, text_threshold, cpu_only=args.cpu_only, token_spans=eval(f"{token_spans}")
)
# visualize pred

View File

@ -0,0 +1,233 @@
import argparse
import os
import sys
import time
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, DistributedSampler
from groundingdino.models import build_model
import groundingdino.datasets.transforms as T
from groundingdino.util import box_ops, get_tokenlizer
from groundingdino.util.misc import clean_state_dict, collate_fn
from groundingdino.util.slconfig import SLConfig
# from torchvision.datasets import CocoDetection
import torchvision
from groundingdino.util.vl_utils import build_captions_and_token_span, create_positive_map_from_span
from groundingdino.datasets.cocogrounding_eval import CocoGroundingEvaluator
def load_model(model_config_path: str, model_checkpoint_path: str, device: str = "cuda"):
args = SLConfig.fromfile(model_config_path)
args.device = device
model = build_model(args)
checkpoint = torch.load(model_checkpoint_path, map_location="cpu")
model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
model.eval()
return model
class CocoDetection(torchvision.datasets.CocoDetection):
def __init__(self, img_folder, ann_file, transforms):
super().__init__(img_folder, ann_file)
self._transforms = transforms
def __getitem__(self, idx):
img, target = super().__getitem__(idx) # target: list
# import ipdb; ipdb.set_trace()
w, h = img.size
boxes = [obj["bbox"] for obj in target]
boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
boxes[:, 2:] += boxes[:, :2] # xywh -> xyxy
boxes[:, 0::2].clamp_(min=0, max=w)
boxes[:, 1::2].clamp_(min=0, max=h)
# filt invalid boxes/masks/keypoints
keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
boxes = boxes[keep]
target_new = {}
image_id = self.ids[idx]
target_new["image_id"] = image_id
target_new["boxes"] = boxes
target_new["orig_size"] = torch.as_tensor([int(h), int(w)])
if self._transforms is not None:
img, target = self._transforms(img, target_new)
return img, target
class PostProcessCocoGrounding(nn.Module):
""" This module converts the model's output into the format expected by the coco api"""
def __init__(self, num_select=300, coco_api=None, tokenlizer=None) -> None:
super().__init__()
self.num_select = num_select
assert coco_api is not None
category_dict = coco_api.dataset['categories']
cat_list = [item['name'] for item in category_dict]
captions, cat2tokenspan = build_captions_and_token_span(cat_list, True)
tokenspanlist = [cat2tokenspan[cat] for cat in cat_list]
positive_map = create_positive_map_from_span(
tokenlizer(captions), tokenspanlist) # 80, 256. normed
id_map = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16, 15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31, 27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43, 39: 44, 40: 46,
41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56, 51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72, 63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85, 75: 86, 76: 87, 77: 88, 78: 89, 79: 90}
# build a mapping from label_id to pos_map
new_pos_map = torch.zeros((91, 256))
for k, v in id_map.items():
new_pos_map[v] = positive_map[k]
self.positive_map = new_pos_map
@torch.no_grad()
def forward(self, outputs, target_sizes, not_to_xyxy=False):
""" Perform the computation
Parameters:
outputs: raw outputs of the model
target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
For evaluation, this must be the original image size (before any data augmentation)
For visualization, this should be the image size after data augment, but before padding
"""
num_select = self.num_select
out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']
# pos map to logit
prob_to_token = out_logits.sigmoid() # bs, 100, 256
pos_maps = self.positive_map.to(prob_to_token.device)
# (bs, 100, 256) @ (91, 256).T -> (bs, 100, 91)
prob_to_label = prob_to_token @ pos_maps.T
# if os.environ.get('IPDB_SHILONG_DEBUG', None) == 'INFO':
# import ipdb; ipdb.set_trace()
assert len(out_logits) == len(target_sizes)
assert target_sizes.shape[1] == 2
prob = prob_to_label
topk_values, topk_indexes = torch.topk(
prob.view(out_logits.shape[0], -1), num_select, dim=1)
scores = topk_values
topk_boxes = topk_indexes // prob.shape[2]
labels = topk_indexes % prob.shape[2]
if not_to_xyxy:
boxes = out_bbox
else:
boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
boxes = torch.gather(
boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
# and from relative [0, 1] to absolute [0, height] coordinates
img_h, img_w = target_sizes.unbind(1)
scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
boxes = boxes * scale_fct[:, None, :]
results = [{'scores': s, 'labels': l, 'boxes': b}
for s, l, b in zip(scores, labels, boxes)]
return results
def main(args):
# config
cfg = SLConfig.fromfile(args.config_file)
# build model
model = load_model(args.config_file, args.checkpoint_path)
model = model.to(args.device)
model = model.eval()
# build dataloader
transform = T.Compose(
[
T.RandomResize([800], max_size=1333),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
]
)
dataset = CocoDetection(
args.image_dir, args.anno_path, transforms=transform)
data_loader = DataLoader(
dataset, batch_size=1, shuffle=False, num_workers=args.num_workers, collate_fn=collate_fn)
# build post processor
tokenlizer = get_tokenlizer.get_tokenlizer(cfg.text_encoder_type)
postprocessor = PostProcessCocoGrounding(
coco_api=dataset.coco, tokenlizer=tokenlizer)
# build evaluator
evaluator = CocoGroundingEvaluator(
dataset.coco, iou_types=("bbox",), useCats=True)
# build captions
category_dict = dataset.coco.dataset['categories']
cat_list = [item['name'] for item in category_dict]
caption = " . ".join(cat_list) + ' .'
print("Input text prompt:", caption)
# run inference
start = time.time()
for i, (images, targets) in enumerate(data_loader):
# get images and captions
images = images.tensors.to(args.device)
bs = images.shape[0]
input_captions = [caption] * bs
# feed to the model
outputs = model(images, captions=input_captions)
orig_target_sizes = torch.stack(
[t["orig_size"] for t in targets], dim=0).to(images.device)
results = postprocessor(outputs, orig_target_sizes)
cocogrounding_res = {
target["image_id"]: output for target, output in zip(targets, results)}
evaluator.update(cocogrounding_res)
if (i+1) % 30 == 0:
used_time = time.time() - start
eta = len(data_loader) / (i+1e-5) * used_time - used_time
print(
f"processed {i}/{len(data_loader)} images. time: {used_time:.2f}s, ETA: {eta:.2f}s")
evaluator.synchronize_between_processes()
evaluator.accumulate()
evaluator.summarize()
print("Final results:", evaluator.coco_eval["bbox"].stats.tolist())
if __name__ == "__main__":
parser = argparse.ArgumentParser(
"Grounding DINO eval on COCO", add_help=True)
# load model
parser.add_argument("--config_file", "-c", type=str,
required=True, help="path to config file")
parser.add_argument(
"--checkpoint_path", "-p", type=str, required=True, help="path to checkpoint file"
)
parser.add_argument("--device", type=str, default="cuda",
help="running device (default: cuda)")
# post processing
parser.add_argument("--num_select", type=int, default=300,
help="number of topk to select")
# coco info
parser.add_argument("--anno_path", type=str,
required=True, help="coco root")
parser.add_argument("--image_dir", type=str,
required=True, help="coco image dir")
parser.add_argument("--num_workers", type=int, default=4,
help="number of workers for dataloader")
args = parser.parse_args()
main(args)

8
docker_test.py 100644
View File

@ -0,0 +1,8 @@
from groundingdino.util.inference import load_model, load_image, predict, annotate
import torch
import cv2
model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.pyy", "weights/groundingdino_swint_ogc.pth")
model = model.to('cuda:0')
print(torch.cuda.is_available())
print('DONE!')

248
environment.yaml 100644
View File

@ -0,0 +1,248 @@
name: dino
channels:
- pytorch
- nvidia
- conda-forge
- defaults
dependencies:
- addict=2.4.0=pyhd8ed1ab_2
- aiohttp=3.8.5=py39ha55989b_0
- aiosignal=1.3.1=pyhd8ed1ab_0
- asttokens=2.0.5=pyhd3eb1b0_0
- async-timeout=4.0.3=pyhd8ed1ab_0
- attrs=23.1.0=pyh71513ae_1
- aws-c-auth=0.7.0=h6f3c987_2
- aws-c-cal=0.6.0=h6ba3258_0
- aws-c-common=0.8.23=hcfcfb64_0
- aws-c-compression=0.2.17=h420beca_1
- aws-c-event-stream=0.3.1=had47b81_1
- aws-c-http=0.7.11=h72ba615_0
- aws-c-io=0.13.28=ha35c040_0
- aws-c-mqtt=0.8.14=h4941efa_2
- aws-c-s3=0.3.13=he04eaa7_2
- aws-c-sdkutils=0.1.11=h420beca_1
- aws-checksums=0.1.16=h420beca_1
- aws-crt-cpp=0.20.3=h247a981_4
- aws-sdk-cpp=1.10.57=h1a0519f_17
- backcall=0.2.0=pyhd3eb1b0_0
- blas=2.118=mkl
- blas-devel=3.9.0=18_win64_mkl
- brotli=1.0.9=hcfcfb64_9
- brotli-bin=1.0.9=hcfcfb64_9
- brotli-python=1.0.9=py39h99910a6_9
- bzip2=1.0.8=h8ffe710_4
- c-ares=1.19.1=hcfcfb64_0
- ca-certificates=2023.08.22=haa95532_0
- certifi=2023.7.22=py39haa95532_0
- charset-normalizer=3.2.0=pyhd8ed1ab_0
- click=8.1.7=win_pyh7428d3b_0
- colorama=0.4.6=pyhd8ed1ab_0
- comm=0.1.2=py39haa95532_0
- contourpy=1.1.1=py39h1f6ef14_1
- cuda-cccl=12.2.140=0
- cuda-cudart=11.8.89=0
- cuda-cudart-dev=11.8.89=0
- cuda-cupti=11.8.87=0
- cuda-libraries=11.8.0=0
- cuda-libraries-dev=11.8.0=0
- cuda-nvrtc=11.8.89=0
- cuda-nvrtc-dev=11.8.89=0
- cuda-nvtx=11.8.86=0
- cuda-profiler-api=12.2.140=0
- cuda-runtime=11.8.0=0
- cycler=0.11.0=pyhd8ed1ab_0
- cython=3.0.0=py39h2bbff1b_0
- dataclasses=0.8=pyhc8e2a94_3
- datasets=2.14.5=pyhd8ed1ab_0
- debugpy=1.6.7=py39hd77b12b_0
- decorator=5.1.1=pyhd3eb1b0_0
- dill=0.3.7=pyhd8ed1ab_0
- exceptiongroup=1.0.4=py39haa95532_0
- executing=0.8.3=pyhd3eb1b0_0
- filelock=3.12.4=pyhd8ed1ab_0
- fonttools=4.42.1=py39ha55989b_0
- freeglut=3.2.2=h63175ca_2
- freetype=2.12.1=hdaf720e_2
- frozenlist=1.4.0=py39ha55989b_1
- fsspec=2023.6.0=pyh1a96a4e_0
- gettext=0.21.1=h5728263_0
- glib=2.78.0=h12be248_0
- glib-tools=2.78.0=h12be248_0
- gst-plugins-base=1.22.6=h001b923_1
- gstreamer=1.22.6=hb4038d2_1
- huggingface_hub=0.17.3=pyhd8ed1ab_0
- icu=70.1=h0e60522_0
- idna=3.4=pyhd8ed1ab_0
- importlib-metadata=6.8.0=pyha770c72_0
- importlib-resources=6.1.0=pyhd8ed1ab_0
- importlib_metadata=6.8.0=hd8ed1ab_0
- importlib_resources=6.1.0=pyhd8ed1ab_0
- intel-openmp=2023.2.0=h57928b3_49503
- ipykernel=6.25.0=py39h9909e9c_0
- ipython=8.15.0=py39haa95532_0
- jasper=2.0.33=hc2e4405_1
- jedi=0.18.1=py39haa95532_1
- jinja2=3.1.2=pyhd8ed1ab_1
- joblib=1.3.2=pyhd8ed1ab_0
- jpeg=9e=hcfcfb64_3
- jupyter_client=8.1.0=py39haa95532_0
- jupyter_core=5.3.0=py39haa95532_0
- kiwisolver=1.4.5=py39h1f6ef14_1
- krb5=1.20.1=heb0366b_0
- lcms2=2.14=h90d422f_0
- lerc=4.0.0=h63175ca_0
- libabseil=20230125.3=cxx17_h63175ca_0
- libarrow=12.0.1=h12e5d06_5_cpu
- libblas=3.9.0=18_win64_mkl
- libbrotlicommon=1.0.9=hcfcfb64_9
- libbrotlidec=1.0.9=hcfcfb64_9
- libbrotlienc=1.0.9=hcfcfb64_9
- libcblas=3.9.0=18_win64_mkl
- libclang=15.0.7=default_h77d9078_3
- libclang13=15.0.7=default_h77d9078_3
- libcrc32c=1.1.2=h0e60522_0
- libcublas=11.11.3.6=0
- libcublas-dev=11.11.3.6=0
- libcufft=10.9.0.58=0
- libcufft-dev=10.9.0.58=0
- libcurand=10.3.3.141=0
- libcurand-dev=10.3.3.141=0
- libcurl=8.1.2=h68f0423_0
- libcusolver=11.4.1.48=0
- libcusolver-dev=11.4.1.48=0
- libcusparse=11.7.5.86=0
- libcusparse-dev=11.7.5.86=0
- libdeflate=1.14=hcfcfb64_0
- libevent=2.1.12=h3671451_1
- libffi=3.4.2=h8ffe710_5
- libglib=2.78.0=he8f3873_0
- libgoogle-cloud=2.12.0=h00b2bdc_1
- libgrpc=1.54.3=ha177ca7_0
- libhwloc=2.9.3=default_haede6df_1009
- libiconv=1.17=h8ffe710_0
- liblapack=3.9.0=18_win64_mkl
- liblapacke=3.9.0=18_win64_mkl
- libnpp=11.8.0.86=0
- libnpp-dev=11.8.0.86=0
- libnvjpeg=11.9.0.86=0
- libnvjpeg-dev=11.9.0.86=0
- libogg=1.3.4=h8ffe710_1
- libopencv=4.5.3=py39h488c12c_8
- libpng=1.6.39=h19919ed_0
- libprotobuf=3.21.12=h12be248_2
- libsodium=1.0.18=h62dcd97_0
- libsqlite=3.43.0=hcfcfb64_0
- libssh2=1.11.0=h7dfc565_0
- libthrift=0.18.1=h06f6336_2
- libtiff=4.4.0=hc4f729c_5
- libutf8proc=2.8.0=h82a8f57_0
- libuv=1.44.2=hcfcfb64_1
- libvorbis=1.3.7=h0e60522_0
- libwebp-base=1.3.2=hcfcfb64_0
- libxcb=1.13=hcd874cb_1004
- libxml2=2.11.5=hc3477c8_1
- libzlib=1.2.13=hcfcfb64_5
- lz4-c=1.9.4=hcfcfb64_0
- m2w64-gcc-libgfortran=5.3.0=6
- m2w64-gcc-libs=5.3.0=7
- m2w64-gcc-libs-core=5.3.0=7
- m2w64-gmp=6.1.0=2
- m2w64-libwinpthread-git=5.0.0.4634.697f757=2
- markupsafe=2.1.3=py39ha55989b_1
- matplotlib-base=3.8.0=py39hf19769e_1
- matplotlib-inline=0.1.6=py39haa95532_0
- mkl=2022.1.0=h6a75c08_874
- mkl-devel=2022.1.0=h57928b3_875
- mkl-include=2022.1.0=h6a75c08_874
- mpmath=1.3.0=pyhd8ed1ab_0
- msys2-conda-epoch=20160418=1
- multidict=6.0.4=py39ha55989b_0
- multiprocess=0.70.15=py39ha55989b_1
- munkres=1.1.4=pyh9f0ad1d_0
- nest-asyncio=1.5.6=py39haa95532_0
- networkx=3.1=pyhd8ed1ab_0
- numpy=1.26.0=py39hddb5d58_0
- opencv=4.5.3=py39hcbf5309_8
- openjpeg=2.5.0=hc9384bd_1
- openssl=3.1.3=hcfcfb64_0
- orc=1.9.0=hada7b9e_1
- packaging=23.1=pyhd8ed1ab_0
- pandas=2.1.1=py39h32e6231_0
- parso=0.8.3=pyhd3eb1b0_0
- pcre2=10.40=h17e33f8_0
- pickleshare=0.7.5=pyhd3eb1b0_1003
- pillow=9.2.0=py39h595c93f_3
- pip=23.2.1=pyhd8ed1ab_0
- platformdirs=3.10.0=pyhd8ed1ab_0
- prompt-toolkit=3.0.36=py39haa95532_0
- psutil=5.9.0=py39h2bbff1b_0
- pthread-stubs=0.4=hcd874cb_1001
- pthreads-win32=2.9.1=hfa6e2cd_3
- pure_eval=0.2.2=pyhd3eb1b0_0
- py-opencv=4.5.3=py39h00e5391_8
- pyarrow=12.0.1=py39hca4e8af_5_cpu
- pycocotools=2.0.6=py39hc266a54_1
- pygments=2.15.1=py39haa95532_1
- pyparsing=3.1.1=pyhd8ed1ab_0
- pysocks=1.7.1=pyh0701188_6
- python=3.9.18=h4de0772_0_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-tzdata=2023.3=pyhd8ed1ab_0
- python-xxhash=3.3.0=py39ha55989b_1
- python_abi=3.9=4_cp39
- pytorch=2.0.1=py3.9_cuda11.8_cudnn8_0
- pytorch-cuda=11.8=h24eeafa_5
- pytorch-mutex=1.0=cuda
- pytz=2023.3.post1=pyhd8ed1ab_0
- pywin32=305=py39h2bbff1b_0
- pyyaml=6.0.1=py39ha55989b_1
- pyzmq=25.1.0=py39hd77b12b_0
- qt-main=5.15.8=h720456b_6
- re2=2023.03.02=hd4eee63_0
- regex=2023.8.8=py39ha55989b_1
- requests=2.31.0=pyhd8ed1ab_0
- sacremoses=0.0.53=pyhd8ed1ab_0
- safetensors=0.3.3=py39hf21820d_1
- setuptools=68.2.2=pyhd8ed1ab_0
- six=1.16.0=pyh6c4a22f_0
- snappy=1.1.10=hfb803bf_0
- stack_data=0.2.0=pyhd3eb1b0_0
- sympy=1.12=pyh04b8f61_3
- tbb=2021.10.0=h91493d7_1
- timm=0.9.7=pyhd8ed1ab_0
- tk=8.6.13=hcfcfb64_0
- tokenizers=0.13.3=py39hca44cb7_0
- tomli=2.0.1=pyhd8ed1ab_0
- tornado=6.3.2=py39h2bbff1b_0
- tqdm=4.66.1=pyhd8ed1ab_0
- traitlets=5.7.1=py39haa95532_0
- transformers=4.33.2=pyhd8ed1ab_0
- typing-extensions=4.8.0=hd8ed1ab_0
- typing_extensions=4.8.0=pyha770c72_0
- tzdata=2023c=h71feb2d_0
- ucrt=10.0.22621.0=h57928b3_0
- unicodedata2=15.0.0=py39ha55989b_1
- urllib3=2.0.5=pyhd8ed1ab_0
- vc=14.3=h64f974e_17
- vc14_runtime=14.36.32532=hdcecf7f_17
- vs2015_runtime=14.36.32532=h05e6639_17
- wcwidth=0.2.5=pyhd3eb1b0_0
- wheel=0.41.2=pyhd8ed1ab_0
- win_inet_pton=1.1.0=pyhd8ed1ab_6
- xorg-libxau=1.0.11=hcd874cb_0
- xorg-libxdmcp=1.1.3=hcd874cb_0
- xxhash=0.8.2=hcfcfb64_0
- xz=5.2.6=h8d14728_0
- yaml=0.2.5=h8ffe710_2
- yapf=0.40.1=pyhd8ed1ab_0
- yarl=1.9.2=py39ha55989b_0
- zeromq=4.3.4=hd77b12b_0
- zipp=3.17.0=pyhd8ed1ab_0
- zlib=1.2.13=hcfcfb64_5
- zstd=1.5.5=h12be248_0
- pip:
- opencv-python==4.8.0.76
- supervision==0.6.0
- torchaudio==2.0.2
- torchvision==0.15.2
prefix: C:\Users\Makoto\miniconda3\envs\dino

View File

@ -0,0 +1,43 @@
batch_size = 1
modelname = "groundingdino"
backbone = "swin_B_384_22k"
position_embedding = "sine"
pe_temperatureH = 20
pe_temperatureW = 20
return_interm_indices = [1, 2, 3]
backbone_freeze_keywords = None
enc_layers = 6
dec_layers = 6
pre_norm = False
dim_feedforward = 2048
hidden_dim = 256
dropout = 0.0
nheads = 8
num_queries = 900
query_dim = 4
num_patterns = 0
num_feature_levels = 4
enc_n_points = 4
dec_n_points = 4
two_stage_type = "standard"
two_stage_bbox_embed_share = False
two_stage_class_embed_share = False
transformer_activation = "relu"
dec_pred_bbox_embed_share = True
dn_box_noise_scale = 1.0
dn_label_noise_ratio = 0.5
dn_label_coef = 1.0
dn_bbox_coef = 1.0
embed_init_tgt = True
dn_labelbook_size = 2000
max_text_len = 256
text_encoder_type = "bert-base-uncased"
use_text_enhancer = True
use_fusion_layer = True
use_checkpoint = True
use_transformer_ckpt = True
use_text_cross_attention = True
text_dropout = 0.0
fusion_dropout = 0.0
fusion_droppath = 0.1
sub_sentence_present = True

View File

View File

@ -0,0 +1,269 @@
# ------------------------------------------------------------------------
# Grounding DINO. Midified by Shilong Liu.
# url: https://github.com/IDEA-Research/GroundingDINO
# Copyright (c) 2023 IDEA. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
# ------------------------------------------------------------------------
# Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
COCO evaluator that works in distributed mode.
Mostly copy-paste from https://github.com/pytorch/vision/blob/edfd5a7/references/detection/coco_eval.py
The difference is that there is less copy-pasting from pycocotools
in the end of the file, as python3 can suppress prints with contextlib
"""
import contextlib
import copy
import os
import numpy as np
import pycocotools.mask as mask_util
import torch
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
from groundingdino.util.misc import all_gather
class CocoGroundingEvaluator(object):
def __init__(self, coco_gt, iou_types, useCats=True):
assert isinstance(iou_types, (list, tuple))
coco_gt = copy.deepcopy(coco_gt)
self.coco_gt = coco_gt
self.iou_types = iou_types
self.coco_eval = {}
for iou_type in iou_types:
self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
self.coco_eval[iou_type].useCats = useCats
self.img_ids = []
self.eval_imgs = {k: [] for k in iou_types}
self.useCats = useCats
def update(self, predictions):
img_ids = list(np.unique(list(predictions.keys())))
self.img_ids.extend(img_ids)
for iou_type in self.iou_types:
results = self.prepare(predictions, iou_type)
# suppress pycocotools prints
with open(os.devnull, "w") as devnull:
with contextlib.redirect_stdout(devnull):
coco_dt = COCO.loadRes(self.coco_gt, results) if results else COCO()
coco_eval = self.coco_eval[iou_type]
coco_eval.cocoDt = coco_dt
coco_eval.params.imgIds = list(img_ids)
coco_eval.params.useCats = self.useCats
img_ids, eval_imgs = evaluate(coco_eval)
self.eval_imgs[iou_type].append(eval_imgs)
def synchronize_between_processes(self):
for iou_type in self.iou_types:
self.eval_imgs[iou_type] = np.concatenate(self.eval_imgs[iou_type], 2)
create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type])
def accumulate(self):
for coco_eval in self.coco_eval.values():
coco_eval.accumulate()
def summarize(self):
for iou_type, coco_eval in self.coco_eval.items():
print("IoU metric: {}".format(iou_type))
coco_eval.summarize()
def prepare(self, predictions, iou_type):
if iou_type == "bbox":
return self.prepare_for_coco_detection(predictions)
elif iou_type == "segm":
return self.prepare_for_coco_segmentation(predictions)
elif iou_type == "keypoints":
return self.prepare_for_coco_keypoint(predictions)
else:
raise ValueError("Unknown iou type {}".format(iou_type))
def prepare_for_coco_detection(self, predictions):
coco_results = []
for original_id, prediction in predictions.items():
if len(prediction) == 0:
continue
boxes = prediction["boxes"]
boxes = convert_to_xywh(boxes).tolist()
scores = prediction["scores"].tolist()
labels = prediction["labels"].tolist()
coco_results.extend(
[
{
"image_id": original_id,
"category_id": labels[k],
"bbox": box,
"score": scores[k],
}
for k, box in enumerate(boxes)
]
)
return coco_results
def prepare_for_coco_segmentation(self, predictions):
coco_results = []
for original_id, prediction in predictions.items():
if len(prediction) == 0:
continue
scores = prediction["scores"]
labels = prediction["labels"]
masks = prediction["masks"]
masks = masks > 0.5
scores = prediction["scores"].tolist()
labels = prediction["labels"].tolist()
rles = [
mask_util.encode(np.array(mask[0, :, :, np.newaxis], dtype=np.uint8, order="F"))[0]
for mask in masks
]
for rle in rles:
rle["counts"] = rle["counts"].decode("utf-8")
coco_results.extend(
[
{
"image_id": original_id,
"category_id": labels[k],
"segmentation": rle,
"score": scores[k],
}
for k, rle in enumerate(rles)
]
)
return coco_results
def prepare_for_coco_keypoint(self, predictions):
coco_results = []
for original_id, prediction in predictions.items():
if len(prediction) == 0:
continue
boxes = prediction["boxes"]
boxes = convert_to_xywh(boxes).tolist()
scores = prediction["scores"].tolist()
labels = prediction["labels"].tolist()
keypoints = prediction["keypoints"]
keypoints = keypoints.flatten(start_dim=1).tolist()
coco_results.extend(
[
{
"image_id": original_id,
"category_id": labels[k],
"keypoints": keypoint,
"score": scores[k],
}
for k, keypoint in enumerate(keypoints)
]
)
return coco_results
def convert_to_xywh(boxes):
xmin, ymin, xmax, ymax = boxes.unbind(1)
return torch.stack((xmin, ymin, xmax - xmin, ymax - ymin), dim=1)
def merge(img_ids, eval_imgs):
all_img_ids = all_gather(img_ids)
all_eval_imgs = all_gather(eval_imgs)
merged_img_ids = []
for p in all_img_ids:
merged_img_ids.extend(p)
merged_eval_imgs = []
for p in all_eval_imgs:
merged_eval_imgs.append(p)
merged_img_ids = np.array(merged_img_ids)
merged_eval_imgs = np.concatenate(merged_eval_imgs, 2)
# keep only unique (and in sorted order) images
merged_img_ids, idx = np.unique(merged_img_ids, return_index=True)
merged_eval_imgs = merged_eval_imgs[..., idx]
return merged_img_ids, merged_eval_imgs
def create_common_coco_eval(coco_eval, img_ids, eval_imgs):
img_ids, eval_imgs = merge(img_ids, eval_imgs)
img_ids = list(img_ids)
eval_imgs = list(eval_imgs.flatten())
coco_eval.evalImgs = eval_imgs
coco_eval.params.imgIds = img_ids
coco_eval._paramsEval = copy.deepcopy(coco_eval.params)
#################################################################
# From pycocotools, just removed the prints and fixed
# a Python3 bug about unicode not defined
#################################################################
def evaluate(self):
"""
Run per image evaluation on given images and store results (a list of dict) in self.evalImgs
:return: None
"""
# tic = time.time()
# print('Running per image evaluation...')
p = self.params
# add backward compatibility if useSegm is specified in params
if p.useSegm is not None:
p.iouType = "segm" if p.useSegm == 1 else "bbox"
print("useSegm (deprecated) is not None. Running {} evaluation".format(p.iouType))
# print('Evaluate annotation type *{}*'.format(p.iouType))
p.imgIds = list(np.unique(p.imgIds))
if p.useCats:
p.catIds = list(np.unique(p.catIds))
p.maxDets = sorted(p.maxDets)
self.params = p
self._prepare()
# loop through images, area range, max detection number
catIds = p.catIds if p.useCats else [-1]
if p.iouType == "segm" or p.iouType == "bbox":
computeIoU = self.computeIoU
elif p.iouType == "keypoints":
computeIoU = self.computeOks
self.ious = {
(imgId, catId): computeIoU(imgId, catId)
for imgId in p.imgIds
for catId in catIds}
evaluateImg = self.evaluateImg
maxDet = p.maxDets[-1]
evalImgs = [
evaluateImg(imgId, catId, areaRng, maxDet)
for catId in catIds
for areaRng in p.areaRng
for imgId in p.imgIds
]
# this is NOT in the pycocotools code, but could be done outside
evalImgs = np.asarray(evalImgs).reshape(len(catIds), len(p.areaRng), len(p.imgIds))
self._paramsEval = copy.deepcopy(self.params)
# toc = time.time()
# print('DONE (t={:0.2f}s).'.format(toc-tic))
return p.imgIds, evalImgs
#################################################################
# end of straight copy from pycocotools, just removing the prints
#################################################################

View File

@ -206,6 +206,21 @@ class GroundingDINO(nn.Module):
nn.init.xavier_uniform_(proj[0].weight, gain=1)
nn.init.constant_(proj[0].bias, 0)
def set_image_tensor(self, samples: NestedTensor):
if isinstance(samples, (list, torch.Tensor)):
samples = nested_tensor_from_tensor_list(samples)
self.features, self.poss = self.backbone(samples)
def unset_image_tensor(self):
if hasattr(self, 'features'):
del self.features
if hasattr(self,'poss'):
del self.poss
def set_image_features(self, features , poss):
self.features = features
self.poss = poss
def init_ref_points(self, use_num_queries):
self.refpoint_embed = nn.Embedding(use_num_queries, self.query_dim)
@ -228,7 +243,6 @@ class GroundingDINO(nn.Module):
captions = kw["captions"]
else:
captions = [t["caption"] for t in targets]
len(captions)
# encoder texts
tokenized = self.tokenizer(captions, padding="longest", return_tensors="pt").to(
@ -283,14 +297,14 @@ class GroundingDINO(nn.Module):
}
# import ipdb; ipdb.set_trace()
if isinstance(samples, (list, torch.Tensor)):
samples = nested_tensor_from_tensor_list(samples)
features, poss = self.backbone(samples)
if not hasattr(self, 'features') or not hasattr(self, 'poss'):
self.set_image_tensor(samples)
srcs = []
masks = []
for l, feat in enumerate(features):
for l, feat in enumerate(self.features):
src, mask = feat.decompose()
srcs.append(self.input_proj[l](src))
masks.append(mask)
@ -299,7 +313,7 @@ class GroundingDINO(nn.Module):
_len_srcs = len(srcs)
for l in range(_len_srcs, self.num_feature_levels):
if l == _len_srcs:
src = self.input_proj[l](features[-1].tensors)
src = self.input_proj[l](self.features[-1].tensors)
else:
src = self.input_proj[l](srcs[-1])
m = samples.mask
@ -307,11 +321,11 @@ class GroundingDINO(nn.Module):
pos_l = self.backbone[1](NestedTensor(src, mask)).to(src.dtype)
srcs.append(src)
masks.append(mask)
poss.append(pos_l)
self.poss.append(pos_l)
input_query_bbox = input_query_label = attn_mask = dn_meta = None
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
srcs, masks, input_query_bbox, poss, input_query_label, attn_mask, text_dict
srcs, masks, input_query_bbox, self.poss, input_query_label, attn_mask, text_dict
)
# deformable-detr-like anchor update
@ -345,7 +359,9 @@ class GroundingDINO(nn.Module):
# interm_class = self.transformer.enc_out_class_embed(hs_enc[-1], text_dict)
# out['interm_outputs'] = {'pred_logits': interm_class, 'pred_boxes': interm_coord}
# out['interm_outputs_for_matching_pre'] = {'pred_logits': interm_class, 'pred_boxes': init_box_proposal}
unset_image_tensor = kw.get('unset_image_tensor', True)
if unset_image_tensor:
self.unset_image_tensor() ## If necessary
return out
@torch.jit.unused
@ -393,3 +409,4 @@ def build_groundingdino(args):
)
return model

View File

@ -1,5 +1,5 @@
from transformers import AutoTokenizer, BertModel, BertTokenizer, RobertaModel, RobertaTokenizerFast
import os
def get_tokenlizer(text_encoder_type):
if not isinstance(text_encoder_type, str):
@ -8,6 +8,8 @@ def get_tokenlizer(text_encoder_type):
text_encoder_type = text_encoder_type.text_encoder_type
elif text_encoder_type.get("text_encoder_type", False):
text_encoder_type = text_encoder_type.get("text_encoder_type")
elif os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type):
pass
else:
raise ValueError(
"Unknown type of text_encoder_type: {}".format(type(text_encoder_type))
@ -19,8 +21,9 @@ def get_tokenlizer(text_encoder_type):
def get_pretrained_language_model(text_encoder_type):
if text_encoder_type == "bert-base-uncased":
if text_encoder_type == "bert-base-uncased" or (os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type)):
return BertModel.from_pretrained(text_encoder_type)
if text_encoder_type == "roberta-base":
return RobertaModel.from_pretrained(text_encoder_type)
raise ValueError("Unknown text_encoder_type {}".format(text_encoder_type))

View File

@ -6,6 +6,7 @@ import supervision as sv
import torch
from PIL import Image
from torchvision.ops import box_convert
import bisect
import groundingdino.datasets.transforms as T
from groundingdino.models import build_model
@ -55,7 +56,8 @@ def predict(
caption: str,
box_threshold: float,
text_threshold: float,
device: str = "cuda"
device: str = "cuda",
remove_combined: bool = False
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
caption = preprocess_caption(caption=caption)
@ -74,17 +76,40 @@ def predict(
tokenizer = model.tokenizer
tokenized = tokenizer(caption)
phrases = [
get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
for logit
in logits
]
if remove_combined:
sep_idx = [i for i in range(len(tokenized['input_ids'])) if tokenized['input_ids'][i] in [101, 102, 1012]]
phrases = []
for logit in logits:
max_idx = logit.argmax()
insert_idx = bisect.bisect_left(sep_idx, max_idx)
right_idx = sep_idx[insert_idx]
left_idx = sep_idx[insert_idx - 1]
phrases.append(get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer, left_idx, right_idx).replace('.', ''))
else:
phrases = [
get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
for logit
in logits
]
return boxes, logits.max(dim=1)[0], phrases
def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor, phrases: List[str]) -> np.ndarray:
"""
This function annotates an image with bounding boxes and labels.
Parameters:
image_source (np.ndarray): The source image to be annotated.
boxes (torch.Tensor): A tensor containing bounding box coordinates.
logits (torch.Tensor): A tensor containing confidence scores for each bounding box.
phrases (List[str]): A list of labels for each bounding box.
Returns:
np.ndarray: The annotated image.
"""
h, w, _ = image_source.shape
boxes = boxes * torch.Tensor([w, h, w, h])
xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
@ -96,9 +121,11 @@ def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor
in zip(phrases, logits)
]
box_annotator = sv.BoxAnnotator()
bbox_annotator = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX)
label_annotator = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX)
annotated_frame = cv2.cvtColor(image_source, cv2.COLOR_RGB2BGR)
annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
annotated_frame = bbox_annotator.annotate(scene=annotated_frame, detections=detections)
annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
return annotated_frame
@ -153,7 +180,8 @@ class Model:
image=processed_image,
caption=caption,
box_threshold=box_threshold,
text_threshold=text_threshold)
text_threshold=text_threshold,
device=self.device)
source_h, source_w, _ = image.shape
detections = Model.post_process_result(
source_h=source_h,
@ -188,14 +216,15 @@ class Model:
box_annotator = sv.BoxAnnotator()
annotated_image = box_annotator.annotate(scene=image, detections=detections)
"""
caption = ", ".join(classes)
caption = ". ".join(classes)
processed_image = Model.preprocess_image(image_bgr=image).to(self.device)
boxes, logits, phrases = predict(
model=self.model,
image=processed_image,
caption=caption,
box_threshold=box_threshold,
text_threshold=text_threshold)
text_threshold=text_threshold,
device=self.device)
source_h, source_w, _ = image.shape
detections = Model.post_process_result(
source_h=source_h,
@ -235,8 +264,10 @@ class Model:
def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray:
class_ids = []
for phrase in phrases:
try:
class_ids.append(classes.index(phrase))
except ValueError:
for class_ in classes:
if class_ in phrase:
class_ids.append(classes.index(class_))
break
else:
class_ids.append(None)
return np.array(class_ids)

View File

@ -2,6 +2,7 @@
# Modified from mmcv
# ==========================================================
import ast
import os
import os.path as osp
import shutil
import sys
@ -80,6 +81,8 @@ class SLConfig(object):
with tempfile.TemporaryDirectory() as temp_config_dir:
temp_config_file = tempfile.NamedTemporaryFile(dir=temp_config_dir, suffix=".py")
temp_config_name = osp.basename(temp_config_file.name)
if os.name == 'nt':
temp_config_file.close()
shutil.copyfile(filename, osp.join(temp_config_dir, temp_config_name))
temp_module_name = osp.splitext(temp_config_name)[0]
sys.path.insert(0, temp_config_dir)

View File

@ -597,10 +597,12 @@ def targets_to(targets: List[Dict[str, Any]], device):
def get_phrases_from_posmap(
posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer
posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer, left_idx: int = 0, right_idx: int = 255
):
assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
if posmap.dim() == 1:
posmap[0: left_idx + 1] = False
posmap[right_idx:] = False
non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
token_ids = [tokenized["input_ids"][i] for i in non_zero_idx]
return tokenizer.decode(token_ids)

View File

@ -1 +0,0 @@
__version__ = "0.1.0"

View File

@ -6,5 +6,5 @@ yapf
timm
numpy
opencv-python
supervision==0.4.0
pycocotools
supervision>=0.22.0
pycocotools

View File

@ -24,6 +24,18 @@ import glob
import os
import subprocess
import subprocess
import sys
def install_torch():
try:
import torch
except ImportError:
subprocess.check_call([sys.executable, "-m", "pip", "install", "torch"])
# Call the function to ensure torch is installed
install_torch()
import torch
from setuptools import find_packages, setup
from torch.utils.cpp_extension import CUDA_HOME, CppExtension, CUDAExtension
@ -70,7 +82,7 @@ def get_extensions():
extra_compile_args = {"cxx": []}
define_macros = []
if torch.cuda.is_available() and CUDA_HOME is not None:
if CUDA_HOME is not None and (torch.cuda.is_available() or "TORCH_CUDA_ARCH_LIST" in os.environ):
print("Compiling with CUDA")
extension = CUDAExtension
sources += source_cuda

114
test.ipynb 100644
View File

@ -0,0 +1,114 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"final text_encoder_type: bert-base-uncased\n"
]
},
{
"data": {
"application/json": {
"ascii": false,
"bar_format": null,
"colour": null,
"elapsed": 0.014210224151611328,
"initial": 0,
"n": 0,
"ncols": null,
"nrows": null,
"postfix": null,
"prefix": "Downloading model.safetensors",
"rate": null,
"total": 440449768,
"unit": "B",
"unit_divisor": 1000,
"unit_scale": true
},
"application/vnd.jupyter.widget-view+json": {
"model_id": "5922f34578364d36afa13de9f01254bd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/root/miniconda3/lib/python3.8/site-packages/transformers/modeling_utils.py:881: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.\n",
" warnings.warn(\n",
"/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None\n",
" warnings.warn(\"None of the inputs have requires_grad=True. Gradients will be None\")\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from groundingdino.util.inference import load_model, load_image, predict, annotate\n",
"import cv2\n",
"\n",
"model = load_model(\"groundingdino/config/GroundingDINO_SwinT_OGC.py\", \"../04-06-segment-anything/weights/groundingdino_swint_ogc.pth\")\n",
"IMAGE_PATH = \".asset/cat_dog.jpeg\"\n",
"TEXT_PROMPT = \"chair . person . dog .\"\n",
"BOX_TRESHOLD = 0.35\n",
"TEXT_TRESHOLD = 0.25\n",
"\n",
"image_source, image = load_image(IMAGE_PATH)\n",
"\n",
"boxes, logits, phrases = predict(\n",
" model=model,\n",
" image=image,\n",
" caption=TEXT_PROMPT,\n",
" box_threshold=BOX_TRESHOLD,\n",
" text_threshold=TEXT_TRESHOLD\n",
")\n",
"\n",
"annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)\n",
"cv2.imwrite(\"annotated_image.jpg\", annotated_frame)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}