Compare commits
62 Commits
v0.1.0-alp
...
main
Author | SHA1 | Date |
---|---|---|
|
856dde20ae | |
|
5a890bd867 | |
|
e49e881edd | |
|
8b6a55f612 | |
|
e27a646ca0 | |
|
d75c95daf6 | |
|
df5b48a3ef | |
|
4330960fa7 | |
|
16e0ccdb7d | |
|
3a2b344737 | |
|
c023468faf | |
|
d13643262e | |
|
2b62f419c2 | |
|
27024e42da | |
|
16e6b4bfcf | |
|
03198a2a79 | |
|
fbb2532bb0 | |
|
eeba084341 | |
|
60d796825e | |
|
5bb6543346 | |
|
b520c15790 | |
|
6c27bc76b9 | |
|
c4c2d69fb4 | |
|
2452fa38d5 | |
|
a0cc07e12f | |
|
4605649b77 | |
|
beeb4c29cb | |
|
9389fa492b | |
|
16292e162d | |
|
4e6f23d35c | |
|
6225f464da | |
|
9a96ef055c | |
|
31aa788a3c | |
|
427aebd59a | |
|
39b1472457 | |
|
654f5e8bf9 | |
|
67bb0b634a | |
|
88a8cd6258 | |
|
db4e6d9680 | |
|
168d65d5c4 | |
|
a4dcf5d411 | |
|
0dc5ece5a2 | |
|
55d5f31b70 | |
|
562643e178 | |
|
92766784b0 | |
|
ff94310921 | |
|
498048b1b2 | |
|
d851b00ed0 | |
|
b091a5bb20 | |
|
da9f1c0751 | |
|
95e0123a14 | |
|
57535c5a79 | |
|
c43cdb3a95 | |
|
bd61f50091 | |
|
dbe0ad8f21 | |
|
049566bdc9 | |
|
19e699c635 | |
|
428ef7fab4 | |
|
9dac4c605b | |
|
3bb2c86c9a | |
|
d3bc35fdea | |
|
15ade007a8 |
Binary file not shown.
After Width: | Height: | Size: 120 KiB |
Binary file not shown.
After Width: | Height: | Size: 354 KiB |
Binary file not shown.
After Width: | Height: | Size: 472 KiB |
Binary file not shown.
After Width: | Height: | Size: 456 KiB |
|
@ -0,0 +1,35 @@
|
|||
FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime
|
||||
ARG DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
ENV CUDA_HOME=/usr/local/cuda \
|
||||
TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX" \
|
||||
SETUPTOOLS_USE_DISTUTILS=stdlib
|
||||
|
||||
RUN conda update conda -y
|
||||
|
||||
# Install libraries in the brand new image.
|
||||
RUN apt-get -y update && apt-get install -y --no-install-recommends \
|
||||
wget \
|
||||
build-essential \
|
||||
git \
|
||||
python3-opencv \
|
||||
ca-certificates && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Set the working directory for all the subsequent Dockerfile instructions.
|
||||
WORKDIR /opt/program
|
||||
|
||||
RUN git clone https://github.com/IDEA-Research/GroundingDINO.git
|
||||
|
||||
RUN mkdir weights ; cd weights ; wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth ; cd ..
|
||||
|
||||
RUN conda install -c "nvidia/label/cuda-12.1.1" cuda -y
|
||||
ENV CUDA_HOME=$CONDA_PREFIX
|
||||
|
||||
ENV PATH=/usr/local/cuda/bin:$PATH
|
||||
|
||||
RUN cd GroundingDINO/ && python -m pip install .
|
||||
|
||||
COPY docker_test.py docker_test.py
|
||||
|
||||
CMD [ "python", "docker_test.py" ]
|
2
LICENSE
2
LICENSE
|
@ -186,7 +186,7 @@
|
|||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright 2020 - present, Facebook, Inc
|
||||
Copyright 2023 - present, IDEA Research.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
|
287
README.md
287
README.md
|
@ -1,41 +1,85 @@
|
|||
# Grounding DINO
|
||||
<div align="center">
|
||||
<img src="./.asset/grounding_dino_logo.png" width="30%">
|
||||
</div>
|
||||
|
||||
---
|
||||
# :sauropod: Grounding DINO
|
||||
|
||||
[](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) [](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
|
||||
[](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) [](https://paperswithcode.com/sota/object-detection-on-coco?p=grounding-dino-marrying-dino-with-grounded)
|
||||
|
||||
|
||||
Grounding DINO Methods | [](https://github.com/IDEA-Research/GroundingDINO)
|
||||
**[IDEA-CVR, IDEA-Research](https://github.com/IDEA-Research)**
|
||||
|
||||
[Shilong Liu](http://www.lsl.zone/), [Zhaoyang Zeng](https://scholar.google.com/citations?user=U_cvvUwAAAAJ&hl=zh-CN&oi=ao), [Tianhe Ren](https://rentainhe.github.io/), [Feng Li](https://scholar.google.com/citations?user=ybRe9GcAAAAJ&hl=zh-CN), [Hao Zhang](https://scholar.google.com/citations?user=B8hPxMQAAAAJ&hl=zh-CN), [Jie Yang](https://github.com/yangjie-cv), [Chunyuan Li](https://scholar.google.com/citations?user=Zd7WmXUAAAAJ&hl=zh-CN&oi=ao), [Jianwei Yang](https://jwyang.github.io/), [Hang Su](https://scholar.google.com/citations?hl=en&user=dxN1_X0AAAAJ&view_op=list_works&sortby=pubdate), [Jun Zhu](https://scholar.google.com/citations?hl=en&user=axsP38wAAAAJ), [Lei Zhang](https://www.leizhang.org/)<sup>:email:</sup>.
|
||||
|
||||
|
||||
[[`Paper`](https://arxiv.org/abs/2303.05499)] [[`Demo`](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)] [[`BibTex`](#black_nib-citation)]
|
||||
|
||||
|
||||
PyTorch implementation and pretrained models for Grounding DINO. For details, see the paper **[Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)**.
|
||||
|
||||
- 🔥 **[Grounded SAM 2](https://github.com/IDEA-Research/Grounded-SAM-2)** is released now, which combines Grounding DINO with [SAM 2](https://github.com/facebookresearch/segment-anything-2) for any object tracking in open-world scenarios.
|
||||
- 🔥 **[Grounding DINO 1.5](https://github.com/IDEA-Research/Grounding-DINO-1.5-API)** is released now, which is IDEA Research's **Most Capable** Open-World Object Detection Model!
|
||||
- 🔥 **[Grounding DINO](https://arxiv.org/abs/2303.05499)** and **[Grounded SAM](https://arxiv.org/abs/2401.14159)** are now supported in Huggingface. For more convenient use, you can refer to [this documentation](https://huggingface.co/docs/transformers/model_doc/grounding-dino)
|
||||
|
||||
## :sun_with_face: Helpful Tutorial
|
||||
|
||||
- :grapes: [[Read our arXiv Paper](https://arxiv.org/abs/2303.05499)]
|
||||
- :apple: [[Watch our simple introduction video on YouTube](https://youtu.be/wxWDt5UiwY8)]
|
||||
- :blossom: [[Try the Colab Demo](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)]
|
||||
- :sunflower: [[Try our Official Huggingface Demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)]
|
||||
- :maple_leaf: [[Watch the Step by Step Tutorial about GroundingDINO by Roboflow AI](https://youtu.be/cMa77r3YrDk)]
|
||||
- :mushroom: [[GroundingDINO: Automated Dataset Annotation and Evaluation by Roboflow AI](https://youtu.be/C4NqaRBz_Kw)]
|
||||
- :hibiscus: [[Accelerate Image Annotation with SAM and GroundingDINO by Roboflow AI](https://youtu.be/oEQYStnF2l8)]
|
||||
- :white_flower: [[Autodistill: Train YOLOv8 with ZERO Annotations based on Grounding-DINO and Grounded-SAM by Roboflow AI](https://github.com/autodistill/autodistill)]
|
||||
|
||||
<!-- Grounding DINO Methods |
|
||||
[](https://arxiv.org/abs/2303.05499)
|
||||
[](https://youtu.be/wxWDt5UiwY8)
|
||||
[](https://youtu.be/wxWDt5UiwY8) -->
|
||||
|
||||
Grounding DINO Demos |
|
||||
[](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)
|
||||
[](https://youtu.be/cMa77r3YrDk)
|
||||
<!-- Grounding DINO Demos |
|
||||
[](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb) -->
|
||||
<!-- [](https://youtu.be/cMa77r3YrDk)
|
||||
[](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)
|
||||
[](https://youtu.be/C4NqaRBz_Kw)
|
||||
[](https://youtu.be/oEQYStnF2l8)
|
||||
[](https://youtu.be/C4NqaRBz_Kw) -->
|
||||
|
||||
Extensions | [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb);
|
||||
[Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
|
||||
## :sparkles: Highlight Projects
|
||||
|
||||
- [Semantic-SAM: a universal image segmentation model to enable segment and recognize anything at any desired granularity.](https://github.com/UX-Decoder/Semantic-SAM),
|
||||
- [DetGPT: Detect What You Need via Reasoning](https://github.com/OptimalScale/DetGPT)
|
||||
- [Grounded-SAM: Marrying Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
|
||||
- [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb)
|
||||
- [Grounding DINO with GLIGEN for Controllable Image Editing](demo/image_editing_with_groundingdino_gligen.ipynb)
|
||||
- [OpenSeeD: A Simple and Strong Openset Segmentation Model](https://github.com/IDEA-Research/OpenSeeD)
|
||||
- [SEEM: Segment Everything Everywhere All at Once](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
|
||||
- [X-GPT: Conversational Visual Agent supported by X-Decoder](https://github.com/microsoft/X-Decoder/tree/xgpt)
|
||||
- [GLIGEN: Open-Set Grounded Text-to-Image Generation](https://github.com/gligen/GLIGEN)
|
||||
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
|
||||
|
||||
<!-- Extensions | [Grounding DINO with Segment Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything); [Grounding DINO with Stable Diffusion](demo/image_editing_with_groundingdino_stablediffusion.ipynb); [Grounding DINO with GLIGEN](demo/image_editing_with_groundingdino_gligen.ipynb) -->
|
||||
|
||||
|
||||
|
||||
[](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=grounding-dino-marrying-dino-with-grounded) \
|
||||
[](https://paperswithcode.com/sota/zero-shot-object-detection-on-odinw?p=grounding-dino-marrying-dino-with-grounded) \
|
||||
[](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=grounding-dino-marrying-dino-with-grounded) \
|
||||
[](https://paperswithcode.com/sota/object-detection-on-coco?p=grounding-dino-marrying-dino-with-grounded)
|
||||
<!-- Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now! -->
|
||||
|
||||
|
||||
|
||||
Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.05499), a stronger open-set object detector. Code is available now!
|
||||
|
||||
|
||||
## Highlight
|
||||
## :bulb: Highlight
|
||||
|
||||
- **Open-Set Detection.** Detect **everything** with language!
|
||||
- **High Performancce.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
|
||||
- **High Performance.** COCO zero-shot **52.5 AP** (training without COCO data!). COCO fine-tune **63.0 AP**.
|
||||
- **Flexible.** Collaboration with Stable Diffusion for Image Editting.
|
||||
|
||||
## News
|
||||
- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) aims to support segmentation in GroundingDINO.
|
||||
|
||||
|
||||
|
||||
## :fire: News
|
||||
- **`2023/07/18`**: We release [Semantic-SAM](https://github.com/UX-Decoder/Semantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity. **Code** and **checkpoint** are available!
|
||||
- **`2023/06/17`**: We provide an example to evaluate Grounding DINO on COCO zero-shot performance.
|
||||
- **`2023/04/15`**: Refer to [CV in the Wild Readings](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings) for those who are interested in open-set recognition!
|
||||
- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN) for more controllable image editings.
|
||||
- **`2023/04/08`**: We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
|
||||
- **`2023/04/06`**: We build a new demo by marrying GroundingDINO with [Segment-Anything](https://github.com/facebookresearch/segment-anything) named **[Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)** aims to support segmentation in GroundingDINO.
|
||||
- **`2023/03/28`**: A YouTube [video](https://youtu.be/cMa77r3YrDk) about Grounding DINO and basic object detection prompt engineering. [[SkalskiP](https://github.com/SkalskiP)]
|
||||
- **`2023/03/28`**: Add a [demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo) on Hugging Face Space!
|
||||
- **`2023/03/27`**: Support CPU-only mode. Now the model can run on machines without GPUs.
|
||||
|
@ -46,44 +90,184 @@ Official PyTorch implementation of [Grounding DINO](https://arxiv.org/abs/2303.0
|
|||
<summary><font size="4">
|
||||
Description
|
||||
</font></summary>
|
||||
<a href="https://arxiv.org/abs/2303.05499">Paper</a> introduction.
|
||||
<img src=".asset/hero_figure.png" alt="ODinW" width="100%">
|
||||
Marrying <a href="https://github.com/IDEA-Research/GroundingDINO">Grounding DINO</a> and <a href="https://github.com/gligen/GLIGEN">GLIGEN</a>
|
||||
<img src="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/GD_GLIGEN.png" alt="gd_gligen" width="100%">
|
||||
</details>
|
||||
|
||||
## :star: Explanations/Tips for Grounding DINO Inputs and Outputs
|
||||
- Grounding DINO accepts an `(image, text)` pair as inputs.
|
||||
- It outputs `900` (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
|
||||
- We defaultly choose the boxes whose highest similarities are higher than a `box_threshold`.
|
||||
- We extract the words whose similarities are higher than the `text_threshold` as predicted labels.
|
||||
- If you want to obtain objects of specific phrases, like the `dogs` in the sentence `two dogs with a stick.`, you can select the boxes with highest text similarities with `dogs` as final outputs.
|
||||
- Note that each word can be split to **more than one** tokens with different tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
|
||||
- We suggest separating different category names with `.` for Grounding DINO.
|
||||

|
||||

|
||||
|
||||
|
||||
## TODO
|
||||
## :label: TODO
|
||||
|
||||
- [x] Release inference code and demo.
|
||||
- [x] Release checkpoints.
|
||||
- [ ] Grounding DINO with Stable Diffusion and GLIGEN demos.
|
||||
- [x] Grounding DINO with Stable Diffusion and GLIGEN demos.
|
||||
- [ ] Release training codes.
|
||||
|
||||
## Install
|
||||
## :hammer_and_wrench: Install
|
||||
|
||||
If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.
|
||||
**Note:**
|
||||
|
||||
0. If you have a CUDA environment, please make sure the environment variable `CUDA_HOME` is set. It will be compiled under CPU-only mode if no CUDA available.
|
||||
|
||||
Please make sure following the installation steps strictly, otherwise the program may produce:
|
||||
```bash
|
||||
NameError: name '_C' is not defined
|
||||
```
|
||||
|
||||
If this happened, please reinstalled the groundingDINO by reclone the git and do all the installation steps again.
|
||||
|
||||
#### how to check cuda:
|
||||
```bash
|
||||
echo $CUDA_HOME
|
||||
```
|
||||
If it print nothing, then it means you haven't set up the path/
|
||||
|
||||
Run this so the environment variable will be set under current shell.
|
||||
```bash
|
||||
export CUDA_HOME=/path/to/cuda-11.3
|
||||
```
|
||||
|
||||
Notice the version of cuda should be aligned with your CUDA runtime, for there might exists multiple cuda at the same time.
|
||||
|
||||
If you want to set the CUDA_HOME permanently, store it using:
|
||||
|
||||
```bash
|
||||
echo 'export CUDA_HOME=/path/to/cuda' >> ~/.bashrc
|
||||
```
|
||||
after that, source the bashrc file and check CUDA_HOME:
|
||||
```bash
|
||||
source ~/.bashrc
|
||||
echo $CUDA_HOME
|
||||
```
|
||||
|
||||
In this example, /path/to/cuda-11.3 should be replaced with the path where your CUDA toolkit is installed. You can find this by typing **which nvcc** in your terminal:
|
||||
|
||||
For instance,
|
||||
if the output is /usr/local/cuda/bin/nvcc, then:
|
||||
```bash
|
||||
export CUDA_HOME=/usr/local/cuda
|
||||
```
|
||||
**Installation:**
|
||||
|
||||
1.Clone the GroundingDINO repository from GitHub.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/IDEA-Research/GroundingDINO.git
|
||||
```
|
||||
|
||||
2. Change the current directory to the GroundingDINO folder.
|
||||
|
||||
```bash
|
||||
cd GroundingDINO/
|
||||
```
|
||||
|
||||
3. Install the required dependencies in the current directory.
|
||||
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## Demo
|
||||
4. Download pre-trained model weights.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=6 python demo/inference_on_a_image.py \
|
||||
-c /path/to/config \
|
||||
-p /path/to/checkpoint \
|
||||
-i .asset/cats.png \
|
||||
-o "outputs/0" \
|
||||
-t "cat ear." \
|
||||
[--cpu-only] # open it for cpu mode
|
||||
mkdir weights
|
||||
cd weights
|
||||
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
|
||||
cd ..
|
||||
```
|
||||
|
||||
## :arrow_forward: Demo
|
||||
Check your GPU ID (only if you're using a GPU)
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
Replace `{GPU ID}`, `image_you_want_to_detect.jpg`, and `"dir you want to save the output"` with appropriate values in the following command
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
|
||||
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
|
||||
-p weights/groundingdino_swint_ogc.pth \
|
||||
-i image_you_want_to_detect.jpg \
|
||||
-o "dir you want to save the output" \
|
||||
-t "chair"
|
||||
[--cpu-only] # open it for cpu mode
|
||||
```
|
||||
|
||||
If you would like to specify the phrases to detect, here is a demo:
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES={GPU ID} python demo/inference_on_a_image.py \
|
||||
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
|
||||
-p ./groundingdino_swint_ogc.pth \
|
||||
-i .asset/cat_dog.jpeg \
|
||||
-o logs/1111 \
|
||||
-t "There is a cat and a dog in the image ." \
|
||||
--token_spans "[[[9, 10], [11, 14]], [[19, 20], [21, 24]]]"
|
||||
[--cpu-only] # open it for cpu mode
|
||||
```
|
||||
The token_spans specify the start and end positions of a phrases. For example, the first phrase is `[[9, 10], [11, 14]]`. `"There is a cat and a dog in the image ."[9:10] = 'a'`, `"There is a cat and a dog in the image ."[11:14] = 'cat'`. Hence it refers to the phrase `a cat` . Similarly, the `[[19, 20], [21, 24]]` refers to the phrase `a dog`.
|
||||
|
||||
See the `demo/inference_on_a_image.py` for more details.
|
||||
|
||||
**Running with Python:**
|
||||
|
||||
```python
|
||||
from groundingdino.util.inference import load_model, load_image, predict, annotate
|
||||
import cv2
|
||||
|
||||
model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
|
||||
IMAGE_PATH = "weights/dog-3.jpeg"
|
||||
TEXT_PROMPT = "chair . person . dog ."
|
||||
BOX_TRESHOLD = 0.35
|
||||
TEXT_TRESHOLD = 0.25
|
||||
|
||||
image_source, image = load_image(IMAGE_PATH)
|
||||
|
||||
boxes, logits, phrases = predict(
|
||||
model=model,
|
||||
image=image,
|
||||
caption=TEXT_PROMPT,
|
||||
box_threshold=BOX_TRESHOLD,
|
||||
text_threshold=TEXT_TRESHOLD
|
||||
)
|
||||
|
||||
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
|
||||
cv2.imwrite("annotated_image.jpg", annotated_frame)
|
||||
```
|
||||
**Web UI**
|
||||
|
||||
We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See the file `demo/gradio_app.py` for more details.
|
||||
|
||||
## Checkpoints
|
||||
**Notebooks**
|
||||
|
||||
- We release [demos](demo/image_editing_with_groundingdino_gligen.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [GLIGEN](https://github.com/gligen/GLIGEN) for more controllable image editings.
|
||||
- We release [demos](demo/image_editing_with_groundingdino_stablediffusion.ipynb) to combine [Grounding DINO](https://arxiv.org/abs/2303.05499) with [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) for image editings.
|
||||
|
||||
## COCO Zero-shot Evaluations
|
||||
|
||||
We provide an example to evaluate Grounding DINO zero-shot performance on COCO. The results should be **48.5**.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python demo/test_ap_on_coco.py \
|
||||
-c groundingdino/config/GroundingDINO_SwinT_OGC.py \
|
||||
-p weights/groundingdino_swint_ogc.pth \
|
||||
--anno_path /path/to/annoataions/ie/instances_val2017.json \
|
||||
--image_dir /path/to/imagedir/ie/val2017
|
||||
```
|
||||
|
||||
|
||||
## :luggage: Checkpoints
|
||||
|
||||
<!-- insert a table -->
|
||||
<table>
|
||||
|
@ -105,13 +289,22 @@ We also provide a demo code to integrate Grounding DINO with Gradio Web UI. See
|
|||
<td>Swin-T</td>
|
||||
<td>O365,GoldG,Cap4M</td>
|
||||
<td>48.4 (zero-shot) / 57.2 (fine-tune)</td>
|
||||
<td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">Github link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
|
||||
<td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth">GitHub link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swint_ogc.pth">HF link</a></td>
|
||||
<td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinT_OGC.py">link</a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>2</th>
|
||||
<td>GroundingDINO-B</td>
|
||||
<td>Swin-B</td>
|
||||
<td>COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO</td>
|
||||
<td>56.7 </td>
|
||||
<td><a href="https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth">GitHub link</a> | <a href="https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/groundingdino_swinb_cogcoor.pth">HF link</a>
|
||||
<td><a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/config/GroundingDINO_SwinB_cfg.py">link</a></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
## Results
|
||||
## :medal_military: Results
|
||||
|
||||
<details open>
|
||||
<summary><font size="4">
|
||||
|
@ -131,26 +324,27 @@ ODinW Object Detection Results
|
|||
<summary><font size="4">
|
||||
Marrying Grounding DINO with <a href="https://github.com/Stability-AI/StableDiffusion">Stable Diffusion</a> for Image Editing
|
||||
</font></summary>
|
||||
See our example: demo/image_editing_with_groundingdino_stablediffusion.ipynb .
|
||||
See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_stablediffusion.ipynb">notebook</a> for more details.
|
||||
<img src=".asset/GD_SD.png" alt="GD_SD" width="100%">
|
||||
</details>
|
||||
|
||||
|
||||
<details open>
|
||||
<summary><font size="4">
|
||||
Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing
|
||||
Marrying Grounding DINO with <a href="https://github.com/gligen/GLIGEN">GLIGEN</a> for more Detailed Image Editing.
|
||||
</font></summary>
|
||||
See our example <a href="https://github.com/IDEA-Research/GroundingDINO/blob/main/demo/image_editing_with_groundingdino_gligen.ipynb">notebook</a> for more details.
|
||||
<img src=".asset/GD_GLIGEN.png" alt="GD_GLIGEN" width="100%">
|
||||
</details>
|
||||
|
||||
## Model
|
||||
## :sauropod: Model: Grounding DINO
|
||||
|
||||
Includes: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder.
|
||||
|
||||

|
||||
|
||||
|
||||
## Acknowledgement
|
||||
## :hearts: Acknowledgement
|
||||
|
||||
Our model is related to [DINO](https://github.com/IDEA-Research/DINO) and [GLIP](https://github.com/microsoft/GLIP). Thanks for their great work!
|
||||
|
||||
|
@ -159,14 +353,15 @@ We also thank great previous work including DETR, Deformable DETR, SMCA, Conditi
|
|||
Thanks [Stable Diffusion](https://github.com/Stability-AI/StableDiffusion) and [GLIGEN](https://github.com/gligen/GLIGEN) for their awesome models.
|
||||
|
||||
|
||||
## Citation
|
||||
## :black_nib: Citation
|
||||
|
||||
If you find our work helpful for your research, please consider citing the following BibTeX entry.
|
||||
|
||||
```bibtex
|
||||
@inproceedings{ShilongLiu2023GroundingDM,
|
||||
title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
|
||||
author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
|
||||
@article{liu2023grounding,
|
||||
title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
|
||||
author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
|
||||
journal={arXiv preprint arXiv:2303.05499},
|
||||
year={2023}
|
||||
}
|
||||
```
|
||||
|
|
|
@ -16,7 +16,7 @@ import torch
|
|||
# prepare the environment
|
||||
os.system("python setup.py build develop --user")
|
||||
os.system("pip install packaging==21.3")
|
||||
os.system("pip install gradio")
|
||||
os.system("pip install gradio==3.50.2")
|
||||
|
||||
|
||||
warnings.filterwarnings("ignore")
|
||||
|
|
File diff suppressed because one or more lines are too long
|
@ -11,6 +11,7 @@ from groundingdino.models import build_model
|
|||
from groundingdino.util import box_ops
|
||||
from groundingdino.util.slconfig import SLConfig
|
||||
from groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap
|
||||
from groundingdino.util.vl_utils import create_positive_map_from_span
|
||||
|
||||
|
||||
def plot_boxes_to_image(image_pil, tgt):
|
||||
|
@ -80,7 +81,8 @@ def load_model(model_config_path, model_checkpoint_path, cpu_only=False):
|
|||
return model
|
||||
|
||||
|
||||
def get_grounding_output(model, image, caption, box_threshold, text_threshold, with_logits=True, cpu_only=False):
|
||||
def get_grounding_output(model, image, caption, box_threshold, text_threshold=None, with_logits=True, cpu_only=False, token_spans=None):
|
||||
assert text_threshold is not None or token_spans is not None, "text_threshould and token_spans should not be None at the same time!"
|
||||
caption = caption.lower()
|
||||
caption = caption.strip()
|
||||
if not caption.endswith("."):
|
||||
|
@ -90,29 +92,56 @@ def get_grounding_output(model, image, caption, box_threshold, text_threshold, w
|
|||
image = image.to(device)
|
||||
with torch.no_grad():
|
||||
outputs = model(image[None], captions=[caption])
|
||||
logits = outputs["pred_logits"].cpu().sigmoid()[0] # (nq, 256)
|
||||
boxes = outputs["pred_boxes"].cpu()[0] # (nq, 4)
|
||||
logits.shape[0]
|
||||
logits = outputs["pred_logits"].sigmoid()[0] # (nq, 256)
|
||||
boxes = outputs["pred_boxes"][0] # (nq, 4)
|
||||
|
||||
# filter output
|
||||
logits_filt = logits.clone()
|
||||
boxes_filt = boxes.clone()
|
||||
filt_mask = logits_filt.max(dim=1)[0] > box_threshold
|
||||
logits_filt = logits_filt[filt_mask] # num_filt, 256
|
||||
boxes_filt = boxes_filt[filt_mask] # num_filt, 4
|
||||
logits_filt.shape[0]
|
||||
if token_spans is None:
|
||||
logits_filt = logits.cpu().clone()
|
||||
boxes_filt = boxes.cpu().clone()
|
||||
filt_mask = logits_filt.max(dim=1)[0] > box_threshold
|
||||
logits_filt = logits_filt[filt_mask] # num_filt, 256
|
||||
boxes_filt = boxes_filt[filt_mask] # num_filt, 4
|
||||
|
||||
# get phrase
|
||||
tokenlizer = model.tokenizer
|
||||
tokenized = tokenlizer(caption)
|
||||
# build pred
|
||||
pred_phrases = []
|
||||
for logit, box in zip(logits_filt, boxes_filt):
|
||||
pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
|
||||
if with_logits:
|
||||
pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
|
||||
else:
|
||||
pred_phrases.append(pred_phrase)
|
||||
else:
|
||||
# given-phrase mode
|
||||
positive_maps = create_positive_map_from_span(
|
||||
model.tokenizer(text_prompt),
|
||||
token_span=token_spans
|
||||
).to(image.device) # n_phrase, 256
|
||||
|
||||
logits_for_phrases = positive_maps @ logits.T # n_phrase, nq
|
||||
all_logits = []
|
||||
all_phrases = []
|
||||
all_boxes = []
|
||||
for (token_span, logit_phr) in zip(token_spans, logits_for_phrases):
|
||||
# get phrase
|
||||
phrase = ' '.join([caption[_s:_e] for (_s, _e) in token_span])
|
||||
# get mask
|
||||
filt_mask = logit_phr > box_threshold
|
||||
# filt box
|
||||
all_boxes.append(boxes[filt_mask])
|
||||
# filt logits
|
||||
all_logits.append(logit_phr[filt_mask])
|
||||
if with_logits:
|
||||
logit_phr_num = logit_phr[filt_mask]
|
||||
all_phrases.extend([phrase + f"({str(logit.item())[:4]})" for logit in logit_phr_num])
|
||||
else:
|
||||
all_phrases.extend([phrase for _ in range(len(filt_mask))])
|
||||
boxes_filt = torch.cat(all_boxes, dim=0).cpu()
|
||||
pred_phrases = all_phrases
|
||||
|
||||
# get phrase
|
||||
tokenlizer = model.tokenizer
|
||||
tokenized = tokenlizer(caption)
|
||||
# build pred
|
||||
pred_phrases = []
|
||||
for logit, box in zip(logits_filt, boxes_filt):
|
||||
pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer)
|
||||
if with_logits:
|
||||
pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
|
||||
else:
|
||||
pred_phrases.append(pred_phrase)
|
||||
|
||||
return boxes_filt, pred_phrases
|
||||
|
||||
|
@ -132,6 +161,12 @@ if __name__ == "__main__":
|
|||
|
||||
parser.add_argument("--box_threshold", type=float, default=0.3, help="box threshold")
|
||||
parser.add_argument("--text_threshold", type=float, default=0.25, help="text threshold")
|
||||
parser.add_argument("--token_spans", type=str, default=None, help=
|
||||
"The positions of start and end positions of phrases of interest. \
|
||||
For example, a caption is 'a cat and a dog', \
|
||||
if you would like to detect 'cat', the token_spans should be '[[[2, 5]], ]', since 'a cat and a dog'[2:5] is 'cat'. \
|
||||
if you would like to detect 'a cat', the token_spans should be '[[[0, 1], [2, 5]], ]', since 'a cat and a dog'[0:1] is 'a', and 'a cat and a dog'[2:5] is 'cat'. \
|
||||
")
|
||||
|
||||
parser.add_argument("--cpu-only", action="store_true", help="running on cpu only!, default=False")
|
||||
args = parser.parse_args()
|
||||
|
@ -143,7 +178,8 @@ if __name__ == "__main__":
|
|||
text_prompt = args.text_prompt
|
||||
output_dir = args.output_dir
|
||||
box_threshold = args.box_threshold
|
||||
text_threshold = args.box_threshold
|
||||
text_threshold = args.text_threshold
|
||||
token_spans = args.token_spans
|
||||
|
||||
# make dir
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
@ -155,9 +191,15 @@ if __name__ == "__main__":
|
|||
# visualize raw image
|
||||
image_pil.save(os.path.join(output_dir, "raw_image.jpg"))
|
||||
|
||||
# set the text_threshold to None if token_spans is set.
|
||||
if token_spans is not None:
|
||||
text_threshold = None
|
||||
print("Using token_spans. Set the text_threshold to None.")
|
||||
|
||||
|
||||
# run model
|
||||
boxes_filt, pred_phrases = get_grounding_output(
|
||||
model, image, text_prompt, box_threshold, text_threshold, cpu_only=args.cpu_only
|
||||
model, image, text_prompt, box_threshold, text_threshold, cpu_only=args.cpu_only, token_spans=eval(f"{token_spans}")
|
||||
)
|
||||
|
||||
# visualize pred
|
||||
|
|
|
@ -0,0 +1,233 @@
|
|||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import DataLoader, DistributedSampler
|
||||
|
||||
from groundingdino.models import build_model
|
||||
import groundingdino.datasets.transforms as T
|
||||
from groundingdino.util import box_ops, get_tokenlizer
|
||||
from groundingdino.util.misc import clean_state_dict, collate_fn
|
||||
from groundingdino.util.slconfig import SLConfig
|
||||
|
||||
# from torchvision.datasets import CocoDetection
|
||||
import torchvision
|
||||
|
||||
from groundingdino.util.vl_utils import build_captions_and_token_span, create_positive_map_from_span
|
||||
from groundingdino.datasets.cocogrounding_eval import CocoGroundingEvaluator
|
||||
|
||||
|
||||
def load_model(model_config_path: str, model_checkpoint_path: str, device: str = "cuda"):
|
||||
args = SLConfig.fromfile(model_config_path)
|
||||
args.device = device
|
||||
model = build_model(args)
|
||||
checkpoint = torch.load(model_checkpoint_path, map_location="cpu")
|
||||
model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
|
||||
model.eval()
|
||||
return model
|
||||
|
||||
|
||||
class CocoDetection(torchvision.datasets.CocoDetection):
|
||||
def __init__(self, img_folder, ann_file, transforms):
|
||||
super().__init__(img_folder, ann_file)
|
||||
self._transforms = transforms
|
||||
|
||||
def __getitem__(self, idx):
|
||||
img, target = super().__getitem__(idx) # target: list
|
||||
|
||||
# import ipdb; ipdb.set_trace()
|
||||
|
||||
w, h = img.size
|
||||
boxes = [obj["bbox"] for obj in target]
|
||||
boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
|
||||
boxes[:, 2:] += boxes[:, :2] # xywh -> xyxy
|
||||
boxes[:, 0::2].clamp_(min=0, max=w)
|
||||
boxes[:, 1::2].clamp_(min=0, max=h)
|
||||
# filt invalid boxes/masks/keypoints
|
||||
keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
|
||||
boxes = boxes[keep]
|
||||
|
||||
target_new = {}
|
||||
image_id = self.ids[idx]
|
||||
target_new["image_id"] = image_id
|
||||
target_new["boxes"] = boxes
|
||||
target_new["orig_size"] = torch.as_tensor([int(h), int(w)])
|
||||
|
||||
if self._transforms is not None:
|
||||
img, target = self._transforms(img, target_new)
|
||||
|
||||
return img, target
|
||||
|
||||
|
||||
class PostProcessCocoGrounding(nn.Module):
|
||||
""" This module converts the model's output into the format expected by the coco api"""
|
||||
|
||||
def __init__(self, num_select=300, coco_api=None, tokenlizer=None) -> None:
|
||||
super().__init__()
|
||||
self.num_select = num_select
|
||||
|
||||
assert coco_api is not None
|
||||
category_dict = coco_api.dataset['categories']
|
||||
cat_list = [item['name'] for item in category_dict]
|
||||
captions, cat2tokenspan = build_captions_and_token_span(cat_list, True)
|
||||
tokenspanlist = [cat2tokenspan[cat] for cat in cat_list]
|
||||
positive_map = create_positive_map_from_span(
|
||||
tokenlizer(captions), tokenspanlist) # 80, 256. normed
|
||||
|
||||
id_map = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16, 15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31, 27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43, 39: 44, 40: 46,
|
||||
41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56, 51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72, 63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85, 75: 86, 76: 87, 77: 88, 78: 89, 79: 90}
|
||||
|
||||
# build a mapping from label_id to pos_map
|
||||
new_pos_map = torch.zeros((91, 256))
|
||||
for k, v in id_map.items():
|
||||
new_pos_map[v] = positive_map[k]
|
||||
self.positive_map = new_pos_map
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, outputs, target_sizes, not_to_xyxy=False):
|
||||
""" Perform the computation
|
||||
Parameters:
|
||||
outputs: raw outputs of the model
|
||||
target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
|
||||
For evaluation, this must be the original image size (before any data augmentation)
|
||||
For visualization, this should be the image size after data augment, but before padding
|
||||
"""
|
||||
num_select = self.num_select
|
||||
out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']
|
||||
|
||||
# pos map to logit
|
||||
prob_to_token = out_logits.sigmoid() # bs, 100, 256
|
||||
pos_maps = self.positive_map.to(prob_to_token.device)
|
||||
# (bs, 100, 256) @ (91, 256).T -> (bs, 100, 91)
|
||||
prob_to_label = prob_to_token @ pos_maps.T
|
||||
|
||||
# if os.environ.get('IPDB_SHILONG_DEBUG', None) == 'INFO':
|
||||
# import ipdb; ipdb.set_trace()
|
||||
|
||||
assert len(out_logits) == len(target_sizes)
|
||||
assert target_sizes.shape[1] == 2
|
||||
|
||||
prob = prob_to_label
|
||||
topk_values, topk_indexes = torch.topk(
|
||||
prob.view(out_logits.shape[0], -1), num_select, dim=1)
|
||||
scores = topk_values
|
||||
topk_boxes = topk_indexes // prob.shape[2]
|
||||
labels = topk_indexes % prob.shape[2]
|
||||
|
||||
if not_to_xyxy:
|
||||
boxes = out_bbox
|
||||
else:
|
||||
boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
|
||||
|
||||
boxes = torch.gather(
|
||||
boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
|
||||
|
||||
# and from relative [0, 1] to absolute [0, height] coordinates
|
||||
img_h, img_w = target_sizes.unbind(1)
|
||||
scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
|
||||
boxes = boxes * scale_fct[:, None, :]
|
||||
|
||||
results = [{'scores': s, 'labels': l, 'boxes': b}
|
||||
for s, l, b in zip(scores, labels, boxes)]
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def main(args):
|
||||
# config
|
||||
cfg = SLConfig.fromfile(args.config_file)
|
||||
|
||||
# build model
|
||||
model = load_model(args.config_file, args.checkpoint_path)
|
||||
model = model.to(args.device)
|
||||
model = model.eval()
|
||||
|
||||
# build dataloader
|
||||
transform = T.Compose(
|
||||
[
|
||||
T.RandomResize([800], max_size=1333),
|
||||
T.ToTensor(),
|
||||
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
|
||||
]
|
||||
)
|
||||
dataset = CocoDetection(
|
||||
args.image_dir, args.anno_path, transforms=transform)
|
||||
data_loader = DataLoader(
|
||||
dataset, batch_size=1, shuffle=False, num_workers=args.num_workers, collate_fn=collate_fn)
|
||||
|
||||
# build post processor
|
||||
tokenlizer = get_tokenlizer.get_tokenlizer(cfg.text_encoder_type)
|
||||
postprocessor = PostProcessCocoGrounding(
|
||||
coco_api=dataset.coco, tokenlizer=tokenlizer)
|
||||
|
||||
# build evaluator
|
||||
evaluator = CocoGroundingEvaluator(
|
||||
dataset.coco, iou_types=("bbox",), useCats=True)
|
||||
|
||||
# build captions
|
||||
category_dict = dataset.coco.dataset['categories']
|
||||
cat_list = [item['name'] for item in category_dict]
|
||||
caption = " . ".join(cat_list) + ' .'
|
||||
print("Input text prompt:", caption)
|
||||
|
||||
# run inference
|
||||
start = time.time()
|
||||
for i, (images, targets) in enumerate(data_loader):
|
||||
# get images and captions
|
||||
images = images.tensors.to(args.device)
|
||||
bs = images.shape[0]
|
||||
input_captions = [caption] * bs
|
||||
|
||||
# feed to the model
|
||||
outputs = model(images, captions=input_captions)
|
||||
|
||||
orig_target_sizes = torch.stack(
|
||||
[t["orig_size"] for t in targets], dim=0).to(images.device)
|
||||
results = postprocessor(outputs, orig_target_sizes)
|
||||
cocogrounding_res = {
|
||||
target["image_id"]: output for target, output in zip(targets, results)}
|
||||
evaluator.update(cocogrounding_res)
|
||||
|
||||
if (i+1) % 30 == 0:
|
||||
used_time = time.time() - start
|
||||
eta = len(data_loader) / (i+1e-5) * used_time - used_time
|
||||
print(
|
||||
f"processed {i}/{len(data_loader)} images. time: {used_time:.2f}s, ETA: {eta:.2f}s")
|
||||
|
||||
evaluator.synchronize_between_processes()
|
||||
evaluator.accumulate()
|
||||
evaluator.summarize()
|
||||
|
||||
print("Final results:", evaluator.coco_eval["bbox"].stats.tolist())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
"Grounding DINO eval on COCO", add_help=True)
|
||||
# load model
|
||||
parser.add_argument("--config_file", "-c", type=str,
|
||||
required=True, help="path to config file")
|
||||
parser.add_argument(
|
||||
"--checkpoint_path", "-p", type=str, required=True, help="path to checkpoint file"
|
||||
)
|
||||
parser.add_argument("--device", type=str, default="cuda",
|
||||
help="running device (default: cuda)")
|
||||
|
||||
# post processing
|
||||
parser.add_argument("--num_select", type=int, default=300,
|
||||
help="number of topk to select")
|
||||
|
||||
# coco info
|
||||
parser.add_argument("--anno_path", type=str,
|
||||
required=True, help="coco root")
|
||||
parser.add_argument("--image_dir", type=str,
|
||||
required=True, help="coco image dir")
|
||||
parser.add_argument("--num_workers", type=int, default=4,
|
||||
help="number of workers for dataloader")
|
||||
args = parser.parse_args()
|
||||
|
||||
main(args)
|
|
@ -0,0 +1,8 @@
|
|||
from groundingdino.util.inference import load_model, load_image, predict, annotate
|
||||
import torch
|
||||
import cv2
|
||||
|
||||
model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.pyy", "weights/groundingdino_swint_ogc.pth")
|
||||
model = model.to('cuda:0')
|
||||
print(torch.cuda.is_available())
|
||||
print('DONE!')
|
|
@ -0,0 +1,248 @@
|
|||
name: dino
|
||||
channels:
|
||||
- pytorch
|
||||
- nvidia
|
||||
- conda-forge
|
||||
- defaults
|
||||
dependencies:
|
||||
- addict=2.4.0=pyhd8ed1ab_2
|
||||
- aiohttp=3.8.5=py39ha55989b_0
|
||||
- aiosignal=1.3.1=pyhd8ed1ab_0
|
||||
- asttokens=2.0.5=pyhd3eb1b0_0
|
||||
- async-timeout=4.0.3=pyhd8ed1ab_0
|
||||
- attrs=23.1.0=pyh71513ae_1
|
||||
- aws-c-auth=0.7.0=h6f3c987_2
|
||||
- aws-c-cal=0.6.0=h6ba3258_0
|
||||
- aws-c-common=0.8.23=hcfcfb64_0
|
||||
- aws-c-compression=0.2.17=h420beca_1
|
||||
- aws-c-event-stream=0.3.1=had47b81_1
|
||||
- aws-c-http=0.7.11=h72ba615_0
|
||||
- aws-c-io=0.13.28=ha35c040_0
|
||||
- aws-c-mqtt=0.8.14=h4941efa_2
|
||||
- aws-c-s3=0.3.13=he04eaa7_2
|
||||
- aws-c-sdkutils=0.1.11=h420beca_1
|
||||
- aws-checksums=0.1.16=h420beca_1
|
||||
- aws-crt-cpp=0.20.3=h247a981_4
|
||||
- aws-sdk-cpp=1.10.57=h1a0519f_17
|
||||
- backcall=0.2.0=pyhd3eb1b0_0
|
||||
- blas=2.118=mkl
|
||||
- blas-devel=3.9.0=18_win64_mkl
|
||||
- brotli=1.0.9=hcfcfb64_9
|
||||
- brotli-bin=1.0.9=hcfcfb64_9
|
||||
- brotli-python=1.0.9=py39h99910a6_9
|
||||
- bzip2=1.0.8=h8ffe710_4
|
||||
- c-ares=1.19.1=hcfcfb64_0
|
||||
- ca-certificates=2023.08.22=haa95532_0
|
||||
- certifi=2023.7.22=py39haa95532_0
|
||||
- charset-normalizer=3.2.0=pyhd8ed1ab_0
|
||||
- click=8.1.7=win_pyh7428d3b_0
|
||||
- colorama=0.4.6=pyhd8ed1ab_0
|
||||
- comm=0.1.2=py39haa95532_0
|
||||
- contourpy=1.1.1=py39h1f6ef14_1
|
||||
- cuda-cccl=12.2.140=0
|
||||
- cuda-cudart=11.8.89=0
|
||||
- cuda-cudart-dev=11.8.89=0
|
||||
- cuda-cupti=11.8.87=0
|
||||
- cuda-libraries=11.8.0=0
|
||||
- cuda-libraries-dev=11.8.0=0
|
||||
- cuda-nvrtc=11.8.89=0
|
||||
- cuda-nvrtc-dev=11.8.89=0
|
||||
- cuda-nvtx=11.8.86=0
|
||||
- cuda-profiler-api=12.2.140=0
|
||||
- cuda-runtime=11.8.0=0
|
||||
- cycler=0.11.0=pyhd8ed1ab_0
|
||||
- cython=3.0.0=py39h2bbff1b_0
|
||||
- dataclasses=0.8=pyhc8e2a94_3
|
||||
- datasets=2.14.5=pyhd8ed1ab_0
|
||||
- debugpy=1.6.7=py39hd77b12b_0
|
||||
- decorator=5.1.1=pyhd3eb1b0_0
|
||||
- dill=0.3.7=pyhd8ed1ab_0
|
||||
- exceptiongroup=1.0.4=py39haa95532_0
|
||||
- executing=0.8.3=pyhd3eb1b0_0
|
||||
- filelock=3.12.4=pyhd8ed1ab_0
|
||||
- fonttools=4.42.1=py39ha55989b_0
|
||||
- freeglut=3.2.2=h63175ca_2
|
||||
- freetype=2.12.1=hdaf720e_2
|
||||
- frozenlist=1.4.0=py39ha55989b_1
|
||||
- fsspec=2023.6.0=pyh1a96a4e_0
|
||||
- gettext=0.21.1=h5728263_0
|
||||
- glib=2.78.0=h12be248_0
|
||||
- glib-tools=2.78.0=h12be248_0
|
||||
- gst-plugins-base=1.22.6=h001b923_1
|
||||
- gstreamer=1.22.6=hb4038d2_1
|
||||
- huggingface_hub=0.17.3=pyhd8ed1ab_0
|
||||
- icu=70.1=h0e60522_0
|
||||
- idna=3.4=pyhd8ed1ab_0
|
||||
- importlib-metadata=6.8.0=pyha770c72_0
|
||||
- importlib-resources=6.1.0=pyhd8ed1ab_0
|
||||
- importlib_metadata=6.8.0=hd8ed1ab_0
|
||||
- importlib_resources=6.1.0=pyhd8ed1ab_0
|
||||
- intel-openmp=2023.2.0=h57928b3_49503
|
||||
- ipykernel=6.25.0=py39h9909e9c_0
|
||||
- ipython=8.15.0=py39haa95532_0
|
||||
- jasper=2.0.33=hc2e4405_1
|
||||
- jedi=0.18.1=py39haa95532_1
|
||||
- jinja2=3.1.2=pyhd8ed1ab_1
|
||||
- joblib=1.3.2=pyhd8ed1ab_0
|
||||
- jpeg=9e=hcfcfb64_3
|
||||
- jupyter_client=8.1.0=py39haa95532_0
|
||||
- jupyter_core=5.3.0=py39haa95532_0
|
||||
- kiwisolver=1.4.5=py39h1f6ef14_1
|
||||
- krb5=1.20.1=heb0366b_0
|
||||
- lcms2=2.14=h90d422f_0
|
||||
- lerc=4.0.0=h63175ca_0
|
||||
- libabseil=20230125.3=cxx17_h63175ca_0
|
||||
- libarrow=12.0.1=h12e5d06_5_cpu
|
||||
- libblas=3.9.0=18_win64_mkl
|
||||
- libbrotlicommon=1.0.9=hcfcfb64_9
|
||||
- libbrotlidec=1.0.9=hcfcfb64_9
|
||||
- libbrotlienc=1.0.9=hcfcfb64_9
|
||||
- libcblas=3.9.0=18_win64_mkl
|
||||
- libclang=15.0.7=default_h77d9078_3
|
||||
- libclang13=15.0.7=default_h77d9078_3
|
||||
- libcrc32c=1.1.2=h0e60522_0
|
||||
- libcublas=11.11.3.6=0
|
||||
- libcublas-dev=11.11.3.6=0
|
||||
- libcufft=10.9.0.58=0
|
||||
- libcufft-dev=10.9.0.58=0
|
||||
- libcurand=10.3.3.141=0
|
||||
- libcurand-dev=10.3.3.141=0
|
||||
- libcurl=8.1.2=h68f0423_0
|
||||
- libcusolver=11.4.1.48=0
|
||||
- libcusolver-dev=11.4.1.48=0
|
||||
- libcusparse=11.7.5.86=0
|
||||
- libcusparse-dev=11.7.5.86=0
|
||||
- libdeflate=1.14=hcfcfb64_0
|
||||
- libevent=2.1.12=h3671451_1
|
||||
- libffi=3.4.2=h8ffe710_5
|
||||
- libglib=2.78.0=he8f3873_0
|
||||
- libgoogle-cloud=2.12.0=h00b2bdc_1
|
||||
- libgrpc=1.54.3=ha177ca7_0
|
||||
- libhwloc=2.9.3=default_haede6df_1009
|
||||
- libiconv=1.17=h8ffe710_0
|
||||
- liblapack=3.9.0=18_win64_mkl
|
||||
- liblapacke=3.9.0=18_win64_mkl
|
||||
- libnpp=11.8.0.86=0
|
||||
- libnpp-dev=11.8.0.86=0
|
||||
- libnvjpeg=11.9.0.86=0
|
||||
- libnvjpeg-dev=11.9.0.86=0
|
||||
- libogg=1.3.4=h8ffe710_1
|
||||
- libopencv=4.5.3=py39h488c12c_8
|
||||
- libpng=1.6.39=h19919ed_0
|
||||
- libprotobuf=3.21.12=h12be248_2
|
||||
- libsodium=1.0.18=h62dcd97_0
|
||||
- libsqlite=3.43.0=hcfcfb64_0
|
||||
- libssh2=1.11.0=h7dfc565_0
|
||||
- libthrift=0.18.1=h06f6336_2
|
||||
- libtiff=4.4.0=hc4f729c_5
|
||||
- libutf8proc=2.8.0=h82a8f57_0
|
||||
- libuv=1.44.2=hcfcfb64_1
|
||||
- libvorbis=1.3.7=h0e60522_0
|
||||
- libwebp-base=1.3.2=hcfcfb64_0
|
||||
- libxcb=1.13=hcd874cb_1004
|
||||
- libxml2=2.11.5=hc3477c8_1
|
||||
- libzlib=1.2.13=hcfcfb64_5
|
||||
- lz4-c=1.9.4=hcfcfb64_0
|
||||
- m2w64-gcc-libgfortran=5.3.0=6
|
||||
- m2w64-gcc-libs=5.3.0=7
|
||||
- m2w64-gcc-libs-core=5.3.0=7
|
||||
- m2w64-gmp=6.1.0=2
|
||||
- m2w64-libwinpthread-git=5.0.0.4634.697f757=2
|
||||
- markupsafe=2.1.3=py39ha55989b_1
|
||||
- matplotlib-base=3.8.0=py39hf19769e_1
|
||||
- matplotlib-inline=0.1.6=py39haa95532_0
|
||||
- mkl=2022.1.0=h6a75c08_874
|
||||
- mkl-devel=2022.1.0=h57928b3_875
|
||||
- mkl-include=2022.1.0=h6a75c08_874
|
||||
- mpmath=1.3.0=pyhd8ed1ab_0
|
||||
- msys2-conda-epoch=20160418=1
|
||||
- multidict=6.0.4=py39ha55989b_0
|
||||
- multiprocess=0.70.15=py39ha55989b_1
|
||||
- munkres=1.1.4=pyh9f0ad1d_0
|
||||
- nest-asyncio=1.5.6=py39haa95532_0
|
||||
- networkx=3.1=pyhd8ed1ab_0
|
||||
- numpy=1.26.0=py39hddb5d58_0
|
||||
- opencv=4.5.3=py39hcbf5309_8
|
||||
- openjpeg=2.5.0=hc9384bd_1
|
||||
- openssl=3.1.3=hcfcfb64_0
|
||||
- orc=1.9.0=hada7b9e_1
|
||||
- packaging=23.1=pyhd8ed1ab_0
|
||||
- pandas=2.1.1=py39h32e6231_0
|
||||
- parso=0.8.3=pyhd3eb1b0_0
|
||||
- pcre2=10.40=h17e33f8_0
|
||||
- pickleshare=0.7.5=pyhd3eb1b0_1003
|
||||
- pillow=9.2.0=py39h595c93f_3
|
||||
- pip=23.2.1=pyhd8ed1ab_0
|
||||
- platformdirs=3.10.0=pyhd8ed1ab_0
|
||||
- prompt-toolkit=3.0.36=py39haa95532_0
|
||||
- psutil=5.9.0=py39h2bbff1b_0
|
||||
- pthread-stubs=0.4=hcd874cb_1001
|
||||
- pthreads-win32=2.9.1=hfa6e2cd_3
|
||||
- pure_eval=0.2.2=pyhd3eb1b0_0
|
||||
- py-opencv=4.5.3=py39h00e5391_8
|
||||
- pyarrow=12.0.1=py39hca4e8af_5_cpu
|
||||
- pycocotools=2.0.6=py39hc266a54_1
|
||||
- pygments=2.15.1=py39haa95532_1
|
||||
- pyparsing=3.1.1=pyhd8ed1ab_0
|
||||
- pysocks=1.7.1=pyh0701188_6
|
||||
- python=3.9.18=h4de0772_0_cpython
|
||||
- python-dateutil=2.8.2=pyhd8ed1ab_0
|
||||
- python-tzdata=2023.3=pyhd8ed1ab_0
|
||||
- python-xxhash=3.3.0=py39ha55989b_1
|
||||
- python_abi=3.9=4_cp39
|
||||
- pytorch=2.0.1=py3.9_cuda11.8_cudnn8_0
|
||||
- pytorch-cuda=11.8=h24eeafa_5
|
||||
- pytorch-mutex=1.0=cuda
|
||||
- pytz=2023.3.post1=pyhd8ed1ab_0
|
||||
- pywin32=305=py39h2bbff1b_0
|
||||
- pyyaml=6.0.1=py39ha55989b_1
|
||||
- pyzmq=25.1.0=py39hd77b12b_0
|
||||
- qt-main=5.15.8=h720456b_6
|
||||
- re2=2023.03.02=hd4eee63_0
|
||||
- regex=2023.8.8=py39ha55989b_1
|
||||
- requests=2.31.0=pyhd8ed1ab_0
|
||||
- sacremoses=0.0.53=pyhd8ed1ab_0
|
||||
- safetensors=0.3.3=py39hf21820d_1
|
||||
- setuptools=68.2.2=pyhd8ed1ab_0
|
||||
- six=1.16.0=pyh6c4a22f_0
|
||||
- snappy=1.1.10=hfb803bf_0
|
||||
- stack_data=0.2.0=pyhd3eb1b0_0
|
||||
- sympy=1.12=pyh04b8f61_3
|
||||
- tbb=2021.10.0=h91493d7_1
|
||||
- timm=0.9.7=pyhd8ed1ab_0
|
||||
- tk=8.6.13=hcfcfb64_0
|
||||
- tokenizers=0.13.3=py39hca44cb7_0
|
||||
- tomli=2.0.1=pyhd8ed1ab_0
|
||||
- tornado=6.3.2=py39h2bbff1b_0
|
||||
- tqdm=4.66.1=pyhd8ed1ab_0
|
||||
- traitlets=5.7.1=py39haa95532_0
|
||||
- transformers=4.33.2=pyhd8ed1ab_0
|
||||
- typing-extensions=4.8.0=hd8ed1ab_0
|
||||
- typing_extensions=4.8.0=pyha770c72_0
|
||||
- tzdata=2023c=h71feb2d_0
|
||||
- ucrt=10.0.22621.0=h57928b3_0
|
||||
- unicodedata2=15.0.0=py39ha55989b_1
|
||||
- urllib3=2.0.5=pyhd8ed1ab_0
|
||||
- vc=14.3=h64f974e_17
|
||||
- vc14_runtime=14.36.32532=hdcecf7f_17
|
||||
- vs2015_runtime=14.36.32532=h05e6639_17
|
||||
- wcwidth=0.2.5=pyhd3eb1b0_0
|
||||
- wheel=0.41.2=pyhd8ed1ab_0
|
||||
- win_inet_pton=1.1.0=pyhd8ed1ab_6
|
||||
- xorg-libxau=1.0.11=hcd874cb_0
|
||||
- xorg-libxdmcp=1.1.3=hcd874cb_0
|
||||
- xxhash=0.8.2=hcfcfb64_0
|
||||
- xz=5.2.6=h8d14728_0
|
||||
- yaml=0.2.5=h8ffe710_2
|
||||
- yapf=0.40.1=pyhd8ed1ab_0
|
||||
- yarl=1.9.2=py39ha55989b_0
|
||||
- zeromq=4.3.4=hd77b12b_0
|
||||
- zipp=3.17.0=pyhd8ed1ab_0
|
||||
- zlib=1.2.13=hcfcfb64_5
|
||||
- zstd=1.5.5=h12be248_0
|
||||
- pip:
|
||||
- opencv-python==4.8.0.76
|
||||
- supervision==0.6.0
|
||||
- torchaudio==2.0.2
|
||||
- torchvision==0.15.2
|
||||
prefix: C:\Users\Makoto\miniconda3\envs\dino
|
|
@ -0,0 +1,43 @@
|
|||
batch_size = 1
|
||||
modelname = "groundingdino"
|
||||
backbone = "swin_B_384_22k"
|
||||
position_embedding = "sine"
|
||||
pe_temperatureH = 20
|
||||
pe_temperatureW = 20
|
||||
return_interm_indices = [1, 2, 3]
|
||||
backbone_freeze_keywords = None
|
||||
enc_layers = 6
|
||||
dec_layers = 6
|
||||
pre_norm = False
|
||||
dim_feedforward = 2048
|
||||
hidden_dim = 256
|
||||
dropout = 0.0
|
||||
nheads = 8
|
||||
num_queries = 900
|
||||
query_dim = 4
|
||||
num_patterns = 0
|
||||
num_feature_levels = 4
|
||||
enc_n_points = 4
|
||||
dec_n_points = 4
|
||||
two_stage_type = "standard"
|
||||
two_stage_bbox_embed_share = False
|
||||
two_stage_class_embed_share = False
|
||||
transformer_activation = "relu"
|
||||
dec_pred_bbox_embed_share = True
|
||||
dn_box_noise_scale = 1.0
|
||||
dn_label_noise_ratio = 0.5
|
||||
dn_label_coef = 1.0
|
||||
dn_bbox_coef = 1.0
|
||||
embed_init_tgt = True
|
||||
dn_labelbook_size = 2000
|
||||
max_text_len = 256
|
||||
text_encoder_type = "bert-base-uncased"
|
||||
use_text_enhancer = True
|
||||
use_fusion_layer = True
|
||||
use_checkpoint = True
|
||||
use_transformer_ckpt = True
|
||||
use_text_cross_attention = True
|
||||
text_dropout = 0.0
|
||||
fusion_dropout = 0.0
|
||||
fusion_droppath = 0.1
|
||||
sub_sentence_present = True
|
|
@ -0,0 +1,269 @@
|
|||
# ------------------------------------------------------------------------
|
||||
# Grounding DINO. Midified by Shilong Liu.
|
||||
# url: https://github.com/IDEA-Research/GroundingDINO
|
||||
# Copyright (c) 2023 IDEA. All Rights Reserved.
|
||||
# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
|
||||
# ------------------------------------------------------------------------
|
||||
# Copyright (c) Aishwarya Kamath & Nicolas Carion. Licensed under the Apache License 2.0. All Rights Reserved
|
||||
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
|
||||
"""
|
||||
COCO evaluator that works in distributed mode.
|
||||
|
||||
Mostly copy-paste from https://github.com/pytorch/vision/blob/edfd5a7/references/detection/coco_eval.py
|
||||
The difference is that there is less copy-pasting from pycocotools
|
||||
in the end of the file, as python3 can suppress prints with contextlib
|
||||
"""
|
||||
import contextlib
|
||||
import copy
|
||||
import os
|
||||
|
||||
import numpy as np
|
||||
import pycocotools.mask as mask_util
|
||||
import torch
|
||||
from pycocotools.coco import COCO
|
||||
from pycocotools.cocoeval import COCOeval
|
||||
|
||||
from groundingdino.util.misc import all_gather
|
||||
|
||||
|
||||
class CocoGroundingEvaluator(object):
|
||||
def __init__(self, coco_gt, iou_types, useCats=True):
|
||||
assert isinstance(iou_types, (list, tuple))
|
||||
coco_gt = copy.deepcopy(coco_gt)
|
||||
self.coco_gt = coco_gt
|
||||
|
||||
self.iou_types = iou_types
|
||||
self.coco_eval = {}
|
||||
for iou_type in iou_types:
|
||||
self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
|
||||
self.coco_eval[iou_type].useCats = useCats
|
||||
|
||||
self.img_ids = []
|
||||
self.eval_imgs = {k: [] for k in iou_types}
|
||||
self.useCats = useCats
|
||||
|
||||
def update(self, predictions):
|
||||
img_ids = list(np.unique(list(predictions.keys())))
|
||||
self.img_ids.extend(img_ids)
|
||||
|
||||
for iou_type in self.iou_types:
|
||||
results = self.prepare(predictions, iou_type)
|
||||
|
||||
# suppress pycocotools prints
|
||||
with open(os.devnull, "w") as devnull:
|
||||
with contextlib.redirect_stdout(devnull):
|
||||
coco_dt = COCO.loadRes(self.coco_gt, results) if results else COCO()
|
||||
|
||||
coco_eval = self.coco_eval[iou_type]
|
||||
|
||||
coco_eval.cocoDt = coco_dt
|
||||
coco_eval.params.imgIds = list(img_ids)
|
||||
coco_eval.params.useCats = self.useCats
|
||||
img_ids, eval_imgs = evaluate(coco_eval)
|
||||
|
||||
self.eval_imgs[iou_type].append(eval_imgs)
|
||||
|
||||
def synchronize_between_processes(self):
|
||||
for iou_type in self.iou_types:
|
||||
self.eval_imgs[iou_type] = np.concatenate(self.eval_imgs[iou_type], 2)
|
||||
create_common_coco_eval(self.coco_eval[iou_type], self.img_ids, self.eval_imgs[iou_type])
|
||||
|
||||
def accumulate(self):
|
||||
for coco_eval in self.coco_eval.values():
|
||||
coco_eval.accumulate()
|
||||
|
||||
def summarize(self):
|
||||
for iou_type, coco_eval in self.coco_eval.items():
|
||||
print("IoU metric: {}".format(iou_type))
|
||||
coco_eval.summarize()
|
||||
|
||||
def prepare(self, predictions, iou_type):
|
||||
if iou_type == "bbox":
|
||||
return self.prepare_for_coco_detection(predictions)
|
||||
elif iou_type == "segm":
|
||||
return self.prepare_for_coco_segmentation(predictions)
|
||||
elif iou_type == "keypoints":
|
||||
return self.prepare_for_coco_keypoint(predictions)
|
||||
else:
|
||||
raise ValueError("Unknown iou type {}".format(iou_type))
|
||||
|
||||
def prepare_for_coco_detection(self, predictions):
|
||||
coco_results = []
|
||||
for original_id, prediction in predictions.items():
|
||||
if len(prediction) == 0:
|
||||
continue
|
||||
|
||||
boxes = prediction["boxes"]
|
||||
boxes = convert_to_xywh(boxes).tolist()
|
||||
scores = prediction["scores"].tolist()
|
||||
labels = prediction["labels"].tolist()
|
||||
|
||||
coco_results.extend(
|
||||
[
|
||||
{
|
||||
"image_id": original_id,
|
||||
"category_id": labels[k],
|
||||
"bbox": box,
|
||||
"score": scores[k],
|
||||
}
|
||||
for k, box in enumerate(boxes)
|
||||
]
|
||||
)
|
||||
return coco_results
|
||||
|
||||
def prepare_for_coco_segmentation(self, predictions):
|
||||
coco_results = []
|
||||
for original_id, prediction in predictions.items():
|
||||
if len(prediction) == 0:
|
||||
continue
|
||||
|
||||
scores = prediction["scores"]
|
||||
labels = prediction["labels"]
|
||||
masks = prediction["masks"]
|
||||
|
||||
masks = masks > 0.5
|
||||
|
||||
scores = prediction["scores"].tolist()
|
||||
labels = prediction["labels"].tolist()
|
||||
|
||||
rles = [
|
||||
mask_util.encode(np.array(mask[0, :, :, np.newaxis], dtype=np.uint8, order="F"))[0]
|
||||
for mask in masks
|
||||
]
|
||||
for rle in rles:
|
||||
rle["counts"] = rle["counts"].decode("utf-8")
|
||||
|
||||
coco_results.extend(
|
||||
[
|
||||
{
|
||||
"image_id": original_id,
|
||||
"category_id": labels[k],
|
||||
"segmentation": rle,
|
||||
"score": scores[k],
|
||||
}
|
||||
for k, rle in enumerate(rles)
|
||||
]
|
||||
)
|
||||
return coco_results
|
||||
|
||||
def prepare_for_coco_keypoint(self, predictions):
|
||||
coco_results = []
|
||||
for original_id, prediction in predictions.items():
|
||||
if len(prediction) == 0:
|
||||
continue
|
||||
|
||||
boxes = prediction["boxes"]
|
||||
boxes = convert_to_xywh(boxes).tolist()
|
||||
scores = prediction["scores"].tolist()
|
||||
labels = prediction["labels"].tolist()
|
||||
keypoints = prediction["keypoints"]
|
||||
keypoints = keypoints.flatten(start_dim=1).tolist()
|
||||
|
||||
coco_results.extend(
|
||||
[
|
||||
{
|
||||
"image_id": original_id,
|
||||
"category_id": labels[k],
|
||||
"keypoints": keypoint,
|
||||
"score": scores[k],
|
||||
}
|
||||
for k, keypoint in enumerate(keypoints)
|
||||
]
|
||||
)
|
||||
return coco_results
|
||||
|
||||
|
||||
def convert_to_xywh(boxes):
|
||||
xmin, ymin, xmax, ymax = boxes.unbind(1)
|
||||
return torch.stack((xmin, ymin, xmax - xmin, ymax - ymin), dim=1)
|
||||
|
||||
|
||||
def merge(img_ids, eval_imgs):
|
||||
all_img_ids = all_gather(img_ids)
|
||||
all_eval_imgs = all_gather(eval_imgs)
|
||||
|
||||
merged_img_ids = []
|
||||
for p in all_img_ids:
|
||||
merged_img_ids.extend(p)
|
||||
|
||||
merged_eval_imgs = []
|
||||
for p in all_eval_imgs:
|
||||
merged_eval_imgs.append(p)
|
||||
|
||||
merged_img_ids = np.array(merged_img_ids)
|
||||
merged_eval_imgs = np.concatenate(merged_eval_imgs, 2)
|
||||
|
||||
# keep only unique (and in sorted order) images
|
||||
merged_img_ids, idx = np.unique(merged_img_ids, return_index=True)
|
||||
merged_eval_imgs = merged_eval_imgs[..., idx]
|
||||
|
||||
return merged_img_ids, merged_eval_imgs
|
||||
|
||||
|
||||
def create_common_coco_eval(coco_eval, img_ids, eval_imgs):
|
||||
img_ids, eval_imgs = merge(img_ids, eval_imgs)
|
||||
img_ids = list(img_ids)
|
||||
eval_imgs = list(eval_imgs.flatten())
|
||||
|
||||
coco_eval.evalImgs = eval_imgs
|
||||
coco_eval.params.imgIds = img_ids
|
||||
coco_eval._paramsEval = copy.deepcopy(coco_eval.params)
|
||||
|
||||
|
||||
#################################################################
|
||||
# From pycocotools, just removed the prints and fixed
|
||||
# a Python3 bug about unicode not defined
|
||||
#################################################################
|
||||
|
||||
|
||||
def evaluate(self):
|
||||
"""
|
||||
Run per image evaluation on given images and store results (a list of dict) in self.evalImgs
|
||||
:return: None
|
||||
"""
|
||||
# tic = time.time()
|
||||
# print('Running per image evaluation...')
|
||||
p = self.params
|
||||
# add backward compatibility if useSegm is specified in params
|
||||
if p.useSegm is not None:
|
||||
p.iouType = "segm" if p.useSegm == 1 else "bbox"
|
||||
print("useSegm (deprecated) is not None. Running {} evaluation".format(p.iouType))
|
||||
# print('Evaluate annotation type *{}*'.format(p.iouType))
|
||||
p.imgIds = list(np.unique(p.imgIds))
|
||||
if p.useCats:
|
||||
p.catIds = list(np.unique(p.catIds))
|
||||
p.maxDets = sorted(p.maxDets)
|
||||
self.params = p
|
||||
|
||||
self._prepare()
|
||||
# loop through images, area range, max detection number
|
||||
catIds = p.catIds if p.useCats else [-1]
|
||||
|
||||
if p.iouType == "segm" or p.iouType == "bbox":
|
||||
computeIoU = self.computeIoU
|
||||
elif p.iouType == "keypoints":
|
||||
computeIoU = self.computeOks
|
||||
self.ious = {
|
||||
(imgId, catId): computeIoU(imgId, catId)
|
||||
for imgId in p.imgIds
|
||||
for catId in catIds}
|
||||
|
||||
evaluateImg = self.evaluateImg
|
||||
maxDet = p.maxDets[-1]
|
||||
evalImgs = [
|
||||
evaluateImg(imgId, catId, areaRng, maxDet)
|
||||
for catId in catIds
|
||||
for areaRng in p.areaRng
|
||||
for imgId in p.imgIds
|
||||
]
|
||||
# this is NOT in the pycocotools code, but could be done outside
|
||||
evalImgs = np.asarray(evalImgs).reshape(len(catIds), len(p.areaRng), len(p.imgIds))
|
||||
self._paramsEval = copy.deepcopy(self.params)
|
||||
# toc = time.time()
|
||||
# print('DONE (t={:0.2f}s).'.format(toc-tic))
|
||||
return p.imgIds, evalImgs
|
||||
|
||||
|
||||
#################################################################
|
||||
# end of straight copy from pycocotools, just removing the prints
|
||||
#################################################################
|
|
@ -206,6 +206,21 @@ class GroundingDINO(nn.Module):
|
|||
nn.init.xavier_uniform_(proj[0].weight, gain=1)
|
||||
nn.init.constant_(proj[0].bias, 0)
|
||||
|
||||
def set_image_tensor(self, samples: NestedTensor):
|
||||
if isinstance(samples, (list, torch.Tensor)):
|
||||
samples = nested_tensor_from_tensor_list(samples)
|
||||
self.features, self.poss = self.backbone(samples)
|
||||
|
||||
def unset_image_tensor(self):
|
||||
if hasattr(self, 'features'):
|
||||
del self.features
|
||||
if hasattr(self,'poss'):
|
||||
del self.poss
|
||||
|
||||
def set_image_features(self, features , poss):
|
||||
self.features = features
|
||||
self.poss = poss
|
||||
|
||||
def init_ref_points(self, use_num_queries):
|
||||
self.refpoint_embed = nn.Embedding(use_num_queries, self.query_dim)
|
||||
|
||||
|
@ -228,7 +243,6 @@ class GroundingDINO(nn.Module):
|
|||
captions = kw["captions"]
|
||||
else:
|
||||
captions = [t["caption"] for t in targets]
|
||||
len(captions)
|
||||
|
||||
# encoder texts
|
||||
tokenized = self.tokenizer(captions, padding="longest", return_tensors="pt").to(
|
||||
|
@ -283,14 +297,14 @@ class GroundingDINO(nn.Module):
|
|||
}
|
||||
|
||||
# import ipdb; ipdb.set_trace()
|
||||
|
||||
if isinstance(samples, (list, torch.Tensor)):
|
||||
samples = nested_tensor_from_tensor_list(samples)
|
||||
features, poss = self.backbone(samples)
|
||||
if not hasattr(self, 'features') or not hasattr(self, 'poss'):
|
||||
self.set_image_tensor(samples)
|
||||
|
||||
srcs = []
|
||||
masks = []
|
||||
for l, feat in enumerate(features):
|
||||
for l, feat in enumerate(self.features):
|
||||
src, mask = feat.decompose()
|
||||
srcs.append(self.input_proj[l](src))
|
||||
masks.append(mask)
|
||||
|
@ -299,7 +313,7 @@ class GroundingDINO(nn.Module):
|
|||
_len_srcs = len(srcs)
|
||||
for l in range(_len_srcs, self.num_feature_levels):
|
||||
if l == _len_srcs:
|
||||
src = self.input_proj[l](features[-1].tensors)
|
||||
src = self.input_proj[l](self.features[-1].tensors)
|
||||
else:
|
||||
src = self.input_proj[l](srcs[-1])
|
||||
m = samples.mask
|
||||
|
@ -307,11 +321,11 @@ class GroundingDINO(nn.Module):
|
|||
pos_l = self.backbone[1](NestedTensor(src, mask)).to(src.dtype)
|
||||
srcs.append(src)
|
||||
masks.append(mask)
|
||||
poss.append(pos_l)
|
||||
self.poss.append(pos_l)
|
||||
|
||||
input_query_bbox = input_query_label = attn_mask = dn_meta = None
|
||||
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
|
||||
srcs, masks, input_query_bbox, poss, input_query_label, attn_mask, text_dict
|
||||
srcs, masks, input_query_bbox, self.poss, input_query_label, attn_mask, text_dict
|
||||
)
|
||||
|
||||
# deformable-detr-like anchor update
|
||||
|
@ -345,7 +359,9 @@ class GroundingDINO(nn.Module):
|
|||
# interm_class = self.transformer.enc_out_class_embed(hs_enc[-1], text_dict)
|
||||
# out['interm_outputs'] = {'pred_logits': interm_class, 'pred_boxes': interm_coord}
|
||||
# out['interm_outputs_for_matching_pre'] = {'pred_logits': interm_class, 'pred_boxes': init_box_proposal}
|
||||
|
||||
unset_image_tensor = kw.get('unset_image_tensor', True)
|
||||
if unset_image_tensor:
|
||||
self.unset_image_tensor() ## If necessary
|
||||
return out
|
||||
|
||||
@torch.jit.unused
|
||||
|
@ -393,3 +409,4 @@ def build_groundingdino(args):
|
|||
)
|
||||
|
||||
return model
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
from transformers import AutoTokenizer, BertModel, BertTokenizer, RobertaModel, RobertaTokenizerFast
|
||||
|
||||
import os
|
||||
|
||||
def get_tokenlizer(text_encoder_type):
|
||||
if not isinstance(text_encoder_type, str):
|
||||
|
@ -8,6 +8,8 @@ def get_tokenlizer(text_encoder_type):
|
|||
text_encoder_type = text_encoder_type.text_encoder_type
|
||||
elif text_encoder_type.get("text_encoder_type", False):
|
||||
text_encoder_type = text_encoder_type.get("text_encoder_type")
|
||||
elif os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type):
|
||||
pass
|
||||
else:
|
||||
raise ValueError(
|
||||
"Unknown type of text_encoder_type: {}".format(type(text_encoder_type))
|
||||
|
@ -19,8 +21,9 @@ def get_tokenlizer(text_encoder_type):
|
|||
|
||||
|
||||
def get_pretrained_language_model(text_encoder_type):
|
||||
if text_encoder_type == "bert-base-uncased":
|
||||
if text_encoder_type == "bert-base-uncased" or (os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type)):
|
||||
return BertModel.from_pretrained(text_encoder_type)
|
||||
if text_encoder_type == "roberta-base":
|
||||
return RobertaModel.from_pretrained(text_encoder_type)
|
||||
|
||||
raise ValueError("Unknown text_encoder_type {}".format(text_encoder_type))
|
||||
|
|
|
@ -6,6 +6,7 @@ import supervision as sv
|
|||
import torch
|
||||
from PIL import Image
|
||||
from torchvision.ops import box_convert
|
||||
import bisect
|
||||
|
||||
import groundingdino.datasets.transforms as T
|
||||
from groundingdino.models import build_model
|
||||
|
@ -55,7 +56,8 @@ def predict(
|
|||
caption: str,
|
||||
box_threshold: float,
|
||||
text_threshold: float,
|
||||
device: str = "cuda"
|
||||
device: str = "cuda",
|
||||
remove_combined: bool = False
|
||||
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
|
||||
caption = preprocess_caption(caption=caption)
|
||||
|
||||
|
@ -74,17 +76,40 @@ def predict(
|
|||
|
||||
tokenizer = model.tokenizer
|
||||
tokenized = tokenizer(caption)
|
||||
|
||||
phrases = [
|
||||
get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
|
||||
for logit
|
||||
in logits
|
||||
]
|
||||
|
||||
if remove_combined:
|
||||
sep_idx = [i for i in range(len(tokenized['input_ids'])) if tokenized['input_ids'][i] in [101, 102, 1012]]
|
||||
|
||||
phrases = []
|
||||
for logit in logits:
|
||||
max_idx = logit.argmax()
|
||||
insert_idx = bisect.bisect_left(sep_idx, max_idx)
|
||||
right_idx = sep_idx[insert_idx]
|
||||
left_idx = sep_idx[insert_idx - 1]
|
||||
phrases.append(get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer, left_idx, right_idx).replace('.', ''))
|
||||
else:
|
||||
phrases = [
|
||||
get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
|
||||
for logit
|
||||
in logits
|
||||
]
|
||||
|
||||
return boxes, logits.max(dim=1)[0], phrases
|
||||
|
||||
|
||||
def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor, phrases: List[str]) -> np.ndarray:
|
||||
"""
|
||||
This function annotates an image with bounding boxes and labels.
|
||||
|
||||
Parameters:
|
||||
image_source (np.ndarray): The source image to be annotated.
|
||||
boxes (torch.Tensor): A tensor containing bounding box coordinates.
|
||||
logits (torch.Tensor): A tensor containing confidence scores for each bounding box.
|
||||
phrases (List[str]): A list of labels for each bounding box.
|
||||
|
||||
Returns:
|
||||
np.ndarray: The annotated image.
|
||||
"""
|
||||
h, w, _ = image_source.shape
|
||||
boxes = boxes * torch.Tensor([w, h, w, h])
|
||||
xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
|
||||
|
@ -96,9 +121,11 @@ def annotate(image_source: np.ndarray, boxes: torch.Tensor, logits: torch.Tensor
|
|||
in zip(phrases, logits)
|
||||
]
|
||||
|
||||
box_annotator = sv.BoxAnnotator()
|
||||
bbox_annotator = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX)
|
||||
label_annotator = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX)
|
||||
annotated_frame = cv2.cvtColor(image_source, cv2.COLOR_RGB2BGR)
|
||||
annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
|
||||
annotated_frame = bbox_annotator.annotate(scene=annotated_frame, detections=detections)
|
||||
annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
|
||||
return annotated_frame
|
||||
|
||||
|
||||
|
@ -153,7 +180,8 @@ class Model:
|
|||
image=processed_image,
|
||||
caption=caption,
|
||||
box_threshold=box_threshold,
|
||||
text_threshold=text_threshold)
|
||||
text_threshold=text_threshold,
|
||||
device=self.device)
|
||||
source_h, source_w, _ = image.shape
|
||||
detections = Model.post_process_result(
|
||||
source_h=source_h,
|
||||
|
@ -188,14 +216,15 @@ class Model:
|
|||
box_annotator = sv.BoxAnnotator()
|
||||
annotated_image = box_annotator.annotate(scene=image, detections=detections)
|
||||
"""
|
||||
caption = ", ".join(classes)
|
||||
caption = ". ".join(classes)
|
||||
processed_image = Model.preprocess_image(image_bgr=image).to(self.device)
|
||||
boxes, logits, phrases = predict(
|
||||
model=self.model,
|
||||
image=processed_image,
|
||||
caption=caption,
|
||||
box_threshold=box_threshold,
|
||||
text_threshold=text_threshold)
|
||||
text_threshold=text_threshold,
|
||||
device=self.device)
|
||||
source_h, source_w, _ = image.shape
|
||||
detections = Model.post_process_result(
|
||||
source_h=source_h,
|
||||
|
@ -235,8 +264,10 @@ class Model:
|
|||
def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray:
|
||||
class_ids = []
|
||||
for phrase in phrases:
|
||||
try:
|
||||
class_ids.append(classes.index(phrase))
|
||||
except ValueError:
|
||||
for class_ in classes:
|
||||
if class_ in phrase:
|
||||
class_ids.append(classes.index(class_))
|
||||
break
|
||||
else:
|
||||
class_ids.append(None)
|
||||
return np.array(class_ids)
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
# Modified from mmcv
|
||||
# ==========================================================
|
||||
import ast
|
||||
import os
|
||||
import os.path as osp
|
||||
import shutil
|
||||
import sys
|
||||
|
@ -80,6 +81,8 @@ class SLConfig(object):
|
|||
with tempfile.TemporaryDirectory() as temp_config_dir:
|
||||
temp_config_file = tempfile.NamedTemporaryFile(dir=temp_config_dir, suffix=".py")
|
||||
temp_config_name = osp.basename(temp_config_file.name)
|
||||
if os.name == 'nt':
|
||||
temp_config_file.close()
|
||||
shutil.copyfile(filename, osp.join(temp_config_dir, temp_config_name))
|
||||
temp_module_name = osp.splitext(temp_config_name)[0]
|
||||
sys.path.insert(0, temp_config_dir)
|
||||
|
|
|
@ -597,10 +597,12 @@ def targets_to(targets: List[Dict[str, Any]], device):
|
|||
|
||||
|
||||
def get_phrases_from_posmap(
|
||||
posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer
|
||||
posmap: torch.BoolTensor, tokenized: Dict, tokenizer: AutoTokenizer, left_idx: int = 0, right_idx: int = 255
|
||||
):
|
||||
assert isinstance(posmap, torch.Tensor), "posmap must be torch.Tensor"
|
||||
if posmap.dim() == 1:
|
||||
posmap[0: left_idx + 1] = False
|
||||
posmap[right_idx:] = False
|
||||
non_zero_idx = posmap.nonzero(as_tuple=True)[0].tolist()
|
||||
token_ids = [tokenized["input_ids"][i] for i in non_zero_idx]
|
||||
return tokenizer.decode(token_ids)
|
||||
|
|
|
@ -1 +0,0 @@
|
|||
__version__ = "0.1.0"
|
|
@ -6,5 +6,5 @@ yapf
|
|||
timm
|
||||
numpy
|
||||
opencv-python
|
||||
supervision==0.4.0
|
||||
pycocotools
|
||||
supervision>=0.22.0
|
||||
pycocotools
|
||||
|
|
14
setup.py
14
setup.py
|
@ -24,6 +24,18 @@ import glob
|
|||
import os
|
||||
import subprocess
|
||||
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
def install_torch():
|
||||
try:
|
||||
import torch
|
||||
except ImportError:
|
||||
subprocess.check_call([sys.executable, "-m", "pip", "install", "torch"])
|
||||
|
||||
# Call the function to ensure torch is installed
|
||||
install_torch()
|
||||
|
||||
import torch
|
||||
from setuptools import find_packages, setup
|
||||
from torch.utils.cpp_extension import CUDA_HOME, CppExtension, CUDAExtension
|
||||
|
@ -70,7 +82,7 @@ def get_extensions():
|
|||
extra_compile_args = {"cxx": []}
|
||||
define_macros = []
|
||||
|
||||
if torch.cuda.is_available() and CUDA_HOME is not None:
|
||||
if CUDA_HOME is not None and (torch.cuda.is_available() or "TORCH_CUDA_ARCH_LIST" in os.environ):
|
||||
print("Compiling with CUDA")
|
||||
extension = CUDAExtension
|
||||
sources += source_cuda
|
||||
|
|
|
@ -0,0 +1,114 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"final text_encoder_type: bert-base-uncased\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/json": {
|
||||
"ascii": false,
|
||||
"bar_format": null,
|
||||
"colour": null,
|
||||
"elapsed": 0.014210224151611328,
|
||||
"initial": 0,
|
||||
"n": 0,
|
||||
"ncols": null,
|
||||
"nrows": null,
|
||||
"postfix": null,
|
||||
"prefix": "Downloading model.safetensors",
|
||||
"rate": null,
|
||||
"total": 440449768,
|
||||
"unit": "B",
|
||||
"unit_divisor": 1000,
|
||||
"unit_scale": true
|
||||
},
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "5922f34578364d36afa13de9f01254bd",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"Downloading model.safetensors: 0%| | 0.00/440M [00:00<?, ?B/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/root/miniconda3/lib/python3.8/site-packages/transformers/modeling_utils.py:881: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.\n",
|
||||
" warnings.warn(\n",
|
||||
"/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None\n",
|
||||
" warnings.warn(\"None of the inputs have requires_grad=True. Gradients will be None\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from groundingdino.util.inference import load_model, load_image, predict, annotate\n",
|
||||
"import cv2\n",
|
||||
"\n",
|
||||
"model = load_model(\"groundingdino/config/GroundingDINO_SwinT_OGC.py\", \"../04-06-segment-anything/weights/groundingdino_swint_ogc.pth\")\n",
|
||||
"IMAGE_PATH = \".asset/cat_dog.jpeg\"\n",
|
||||
"TEXT_PROMPT = \"chair . person . dog .\"\n",
|
||||
"BOX_TRESHOLD = 0.35\n",
|
||||
"TEXT_TRESHOLD = 0.25\n",
|
||||
"\n",
|
||||
"image_source, image = load_image(IMAGE_PATH)\n",
|
||||
"\n",
|
||||
"boxes, logits, phrases = predict(\n",
|
||||
" model=model,\n",
|
||||
" image=image,\n",
|
||||
" caption=TEXT_PROMPT,\n",
|
||||
" box_threshold=BOX_TRESHOLD,\n",
|
||||
" text_threshold=TEXT_TRESHOLD\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)\n",
|
||||
"cv2.imwrite(\"annotated_image.jpg\", annotated_frame)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "base",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.10"
|
||||
},
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
Loading…
Reference in New Issue